U1 - NLP Complete
U1 - NLP Complete
School of computing
2. Make them understand the concepts of morphology, syntax, semantics and pragmatics of the language
and that they are able to give the appropriate examples that will illustrate the above mentioned concept.
3. Teach them to recognize the significance of pragmatics for natural language understanding.
4. Enable students to be capable to describe the application based on natural language processing and to
show the points of syntactic, semantic and pragmatic processing.
5. To understand natural language processing and to learn how to apply basic algorithms in this field.
CO2: : Analyze approaches to syntax and semantic parsing with pronoun resolution.
CO3: Implement semantic role, relations and frames, Including co reference resolution.
CO5: Apply the knowledge of various levels of analysis involved in NLP and Implement
• According to the industry the estimation is only 21% of the available data is in the
structured form.
• Data is being generate, send messages on WhatsApp and Facebook or various social media.
• Majority of the data exists in textual format which is highly unstructured form.
• Now in order to produce significant and actionable insights from this data it is important to
get acquainted with the techniques of text analysis and natural language processing.
28-01-2025
NATURAL LANGUAGE PROCESSING
Need of NLP
1. User-Friendly Interfaces: NLP allows for intuitive and user-friendly interfaces using natural language,
reducing the need for complex programming syntax.
2. Accessibility and Inclusivity: NLP makes technology accessible to a wider audience, including those with
limited technical expertise or disabilities.
3. Conversational Systems: NLP enables the development of conversational agents, enhancing user interaction
and system efficiency.
4. Data Extraction and Analysis: NLP extracts insights from unstructured text data, enabling sentiment
analysis, information retrieval, and text summarization.
5. Voice-based Interaction: NLP powers voice assistants and speech recognition systems for hands-free and
natural interaction.
6. Human-Machine Collaboration: NLP enables seamless communication and collaboration between humans
and machines.
7. Natural Language Understanding: NLP allows machines to comprehend context, semantics, and intent,
enabling advanced applications and personalized experiences.
1. Gmail - when u r typing any sentence in gmail you will notice that it tries to auto
complete. Auto completion is done using NLP.
2. Spam filters - This emails didn’t have spam filters then you will be so much worried you
will get so much mails/headache. Using NLP we can filter them and using keywords take
them out of your inbox.
3. Language translation - Translate sentence from one language to another language.
4. Customer service chat bot - ex: in bank service chat bot if you type in a message and
many times there is no human on the other hand. Your chat bot can interpret your
language it can derive intent out of it and it can respond to your question on its own and
sometimes it doesn’t work well then they connect it to human beings.
5. Voice assistants at amazon, alexa and google assistant
6. Google search - BERT special language model will give a correct solution for all search
question.
28-01-2025 NATURAL LANGUAGE PROCESSING
Advantages of NLP
• To ask questions about any subject and get a direct response within seconds.
• NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
• NLP helps computers to communicate with humans in their languages.
• It is very time efficient.
• Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
4.Discourse
Semantic
Analysis 5.Pragmatics
Discourse
Pragmatics
Syntax Analysis ensures that a given piece of text is correct structure. It tries to
Given the possible POS generated from the previous step, a syntax analyzer
• Consider the sentence: “The apple ate a banana”. Although the sentence is
syntactically correct, it doesn’t make sense because apples can’t eat.
• Semantic analysis looks for meaning in the given sentence. It also deals with
combining words into phrases.
• For example, “red apple” provides information regarding one object; hence we
treat it as a single phrase.
• Similarly, we can group names referring to the same category, person, object or
organization. “Robert Hill” refers to the same person and not two separate
names – “Robert” and “Hill”
Pragmatics
• The final stage of NLP, Pragmatics interprets the given text using information
from the previous steps. Given a sentence, “Turn off the lights” is an order or
request to switch off the lights.
■
Regular expression (RE): A formula (in a special language) that is used for
specifying simple classes of strings.
■
String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, and
punctuation).
■
Can be used to specify search strings as well as to define a language in a
formal way.
■
Search requires a pattern to search for, and a corpus of texts to search through.
■ Search through corpus and return all texts that contain pattern.
RE Patterns
■
The search string can consist of single character or a sequence of
characters.
RE String matched
■
Regular Expressions are case sensitive.
■
The string of characters inside [ ] specify a disjunction of
characters to match.
RE Range
■
How to conveniently specify any capital letters ?
■
Use brackets [ ] with the dash (-) to specify any one character in a
range
■
[2-5] – specifies any one of 2, 3, 4, or 5
RE Negation
■
Uses of the caret ^ for negation or just to mean ^
■
^ symbol is first after open square brace [ , the resulting pattern is
negated
RE Cleany star
■
Regular expression allows repetition of things.
■
Kleene star – zero or more occurrences of previous character or
expressions.
■
Kleene * ----- /baaa*!/ --- baa!, baaa!, baaaa! .....
■
Kleene + – one or more of the previous character
■
Kleene + ---- /[09]+/ specifies “a sequence of digits”
■
Use period /./ to specify any character – a wildcard that matches any
single character (except a carriage return)
RE Cleany star
RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s
/a?/ Zero or one a’s
/cat|dog/ ‘cat’ or ‘dog’
/^cat$/ A line containing only ‘cat’
/\bun\B/ Beginnings of longer strings starts by ‘un’
RE Anchors, Boundaries
■
The caret ^ matches the start of a line.
■
The dollar sign $ matches the end of a line.
■
Ex: /^The boat\.$/ matches a line that contains The boat.
■
\b matches a word boundary while \B matches a non-boundary
■
Ex: /\b55\b/ matches the string: There are 55 bottles of honey
but not There are 255 bottles of honey
RE Disjunction, Grouping
■
The pipe symbol | is called the disjunction operator
■
Example: /food|wood/ matches either the string food or the string
wood
■
What is the pattern for matching both the string puppy and
puppies?
■
/puppy|ies/ --> match the strings puppy and ies hence wrong
■
The string puppy take precendece over the pipe operator
■
Use the parenthesis ( and ) to make the disjunction ( | ) apply only to a
specific pattern
■
/pupp(y|ies)/ > match the strings puppy and puppies
RE Operator Precedence
■
Kleene* operator applies by default only to a single character, not a
whole sequence.
■
Ex: Write a pattern to match the string:
Column 1 Column 2 Column 3
■
/Column_[09]+_*/ matches a column followed by any number of
spaces
■
The star applies only to the space _ that precedes it, not a whole
sequence
■
/(Column_[09]+_)*/ --> match the word Column followed by a
number, the whole pattern repeated any number of times
RE Operator Precedence
■
Parenthesis ( )
Counters * + ? { } Sequences and
■
■
Counters have higher precedence than sequences
■ ■ /the*/ matches theeeee but not thethe
■
Sequences have a higher precedence than disjunction
■
missed ‘The’
RE – A Simple Example
■
Write a RE to match the English article the
■
/the/ missed ‘The’
■
/[tT]he/
■
Need The or the not the in ‘others’. Include word boundary
RE – A Simple Example
■
Write a RE to match the English article the
■
/the/ missed ‘The’
■
/[tT]he/ included the in ‘others’
■
/\b[tT]he\b/
■
Perl – word is a sequence of letters, digits and underscores
■
Need 'the' from ‘the25’ or ‘the_’
RE – A Simple Example
■
Write a RE to match the English article the
■
/the/ missed ‘The’
included the in ‘others’
■
/[tT]he/
■
/\b[tT]he\b/ missed ‘the25’ ‘the_’
■
Make sure no alphabetic letters on either side of the
■
/[^azAZ][tT]he[^azAZ]/
■
Issue: won't find the word The when it begins the line.
RE – A Simple Example
■
Write a RE to match the English article the
■
/the/ missed ‘The’
included the in ‘others’
■
/[tT]he/
■
/\b[tT]he\b/ missed ‘the25’ ‘the_’
■
/[^azAZ][tT]he[^azAZ]/ missed ‘The’ at the beginning of a line
■
Specify that before the the we require either the beginning-of-line or
non-alphabetic character and the same at end.
■
/(^|[^azAZ])[tT]he([^azAZ]|$)/
RE – A Complex Example
■
Exercise: Write a regular expression that will match
■ “any PC with more than 500MHz and 32 Gb of disk space for
■
What about $155.55 ?
■
Deal with fraction of dollars
RE – A Complex Example
■
/$[09]+/ # whole dollars
■
/$[09]+\.[09][09]/ # fractions of dollars
■
This pattern only allows $155.55 but not $155
■
Make cents optional and word boundary
RE – A Complex Example
■
/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars
■
•/$[09]+(\.[09][09])?/ # cents optional
■
•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for
■
processor speed (in megahertz=MHz or gigahertz=GHz)?
RE – A Complex Example
■
/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars
■
•/$[09]+(\.[09][09])?/ # cents optional
■
•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for
■
processor speed (in megahertz=MHz or gigahertz=GHz)?
■
/\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
■
/_*/ mean “zero or more spaces”
■
Memory size?
RE – A Complex Example
■
/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars
■
•/$[09]+(\.[09][09])?/ # cents optional
■
•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for
■
processor speed (in megahertz=MHz or gigahertz=GHz)?
■
/\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
■
Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/
■
Allow gigabyte fractions like 5.5Gb
RE – A Complex Example
■
/$[09]+/ # whole dollars
■ •/$[09]+\.[09][09]/ # fractions of dollars
■
•/$[09]+(\.[09][09])?/ # cents optional
•/\b$[09]+(\.[09][09])?\b/ # word boundary Specification for
■
■
Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/
■
/\b[09](\.[09]+)?_*(Gb|[Gg]igabytes?)\b/
■
Operating system and Vendor ?
RE – A Complex Example
■
/$[09]+/ # whole dollars
■
/$[09]+\.[09][09]/ # fractions of dollars
■
/$[09]+(\.[09][09])?/ # cents optional
■
/\b$[09]+(\.[09][09])?\b/ # word boundary
■
Speed : /\b[09]+_*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
■
Memory size: /\b[09]+_*(Mb|[Mm]egabytes?)\b/
■
/\b[09](\.[09]+)?_*(Gb|[Gg]igabytes?)\b/
■
Vendor: /\b(Win95|Win98|WinNT|Windows_*(NT|95|98|
2000)?)\b/
■
/\b(Mac|Macintosh|Apple)\b/
Morphological Analysis (Morphological Parsing)
• The goal of morphological parsing is to find out what morphemes a given word is
built from.
• The goal of morphological parsing is to find out what morphemes a given word is
built from. For example, a morphological parser should be able to tell us that the
word cats is the plural form of the noun stem cat, and that the word mice is the
plural form of the noun stem mouse. So, given the string cats as input, a
morphological parser should produce an output that looks similar to cat N PL.
Here are some more examples:
• stemming algorithm works by cutting off the end or the beginning of the word taking
into account a list of common prefixes suffixes that can be found in an infected word.
• Somehow similar to stemming, as it maps several words into one common root.
• For example, Lemmatizer should map gone,going and went into go.
28-01-2025 NATURAL LANGUAGE PROCESSING
POS tags
• Speaking the grammatical type of the word is referred to as POS tags or parts of speech.
• Nouns, pronouns, verbs, adverbs, adjectives, prepositions, Conjunctions and Interjections.
• Process of detecting the named entities such as the person name, company name and location that is pharse
identification.
• TF-IDF
28-01-2025 NATURAL LANGUAGE PROCESSING
Bag of Words:
• It represents a text document as a multiset of its words,
disregarding grammar and word order, but keeping the frequency
of words.
• This representation is useful for tasks such as text classification,
document similarity, and text clustering.
• To transform tokens into a set of features.
• In document classification, For example, in a task of review based
sentiment analysis, the presence of words like ‘fabulous’,
‘excellent’ indicates a positive review, while words
like ‘annoying’, ‘poor’ point to a negative review
2. Lack of context information: The bag of words model only considers the frequency of words
in a document, disregarding grammar, word order, and context.
3. Insensitivity to word associations: The bag of words model doesn’t consider the associations
between words, and the semantic relationships between words in a document.
4. Lack of semantic information: As the bag of words model only considers individual words, it
does not capture semantic relationships or the meaning of words in context.
5. Importance of stop words: Stop words, such as “the”, “and”, “a”, etc., can have a large impact
on the bag of words representation of a document, even though they may not carry much
meaning.
6. Sparsity: For many applications, the bag of words representation of a document can be very
sparse, meaning that most entries in the resulting feature vector will be zero. This can lead
to issues with computational efficiency and difficulty in interpretability.
• A free morpheme is a single meaningful unit of a word that can stand alone in the language. For
example: cat, mat, trust, slow.
• A bound morpheme cannot stand alone, it has no real meaning if it is on its own. For example:
walked, (ed) can not stand alone or unpleasant (un) is not a stand alone morpheme.Bound
morphemes that are part of prefixes and suffixes.
• Bound morphemes can also be grouped into into a further two categories.
1. Derivational 2. Inflectional
• Look at the word able and let it become ability. In this instance the adjective becomes a noun.
• The word send as a verb morpheme becomes sender and a noun with the addition of er.
• While stable to unstable changes the meaning of the word to become the opposite meaning.
• In other words the meaning of the word is completely changed by adding a derivational morpheme
to a base word.
• Additions to the base word that do not change the word, but rather serve as grammatical indicators.
They show grammatical information. For example:
1. Laugh becomes the past tense by adding ed and changing the word to laughed.
4. All these examples show how morphology participates in the study of linguistics.
• The goal of POS-tagging is to resolve these ambiguity , choosing the proper tag
for the context.
28-01-2025 NATURAL LANGUAGE PROCESSING
Introduction to POS Tagging
• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a text is
labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity recognition, and
machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate words that have
multiple meanings.
• example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner , cat: noun , sat: verb , on: preposition , the: determiner ,mat: noun
• It is a process of converting a sentence to forms – list of words, list of tuples (where each
tuple is having a form (word, tag)).
• The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on Part Of Speech Tag
Noun (Singular) NN
Verb VB
Determiner DT
Adjective JJ
Adverb RB
•Collect a dataset of annotated text: This dataset will be used to train and test the POS tagger.
•The text should be annotated with the correct POS tags for each word.
•Preprocess the text: This may include tasks such as tokenization (splitting the text into individual words),
•lowercasing, and removing punctuation.
•Divide the dataset into training and testing sets: The training set will be used to train the POS tagger,
• and the testing set will be used to evaluate its performance.
•Train the POS tagger: This may involve building a statistical model, such as a hidden Markov model (HMM),
•or defining a set of rules for a rule-based or transformation-based tagger.
•The model or rules will be trained on the annotated text in the training set.
•Test the POS tagger: Use the trained model or rules to predict the POS tags of the words in the testing set.
•Compare the predicted tags to the true tags and calculate metrics such as precision and recall to evaluate
•the performance of the tagger.
•Fine-tune the POS tagger: If the performance of the tagger is not satisfactory, adjust the model or rules and
•repeat the training and testing process until the desired level of accuracy is achieved.
•Use the POS tagger: Once the tagger is trained and tested, it can be used to perform POS tagging on new,
•unseen text.
28-01-2025 NATURAL LANGUAGE PROCESSING
Different POS Tagging Techniques
1. Define a set of rules for assigning POS tags to words. For example:
• If the word ends in “-tion,” assign the tag “noun.”
• If the word ends in “-ment,” assign the tag “noun.”
• If the word is all uppercase, assign the tag “proper noun.”
• If the word is a verb ending in “-ing,” assign the tag “verb.”
2. Iterate through the words in the text and apply the rules to each word in turn. For example:
• “Nation” would be tagged as “noun” based on the first rule.
• “Investment” would be tagged as “noun” based on the second rule.
• “UNITED” would be tagged as “proper noun” based on the third rule.
• “Running” would be tagged as “verb” based on the fourth rule.
3. Output the POS tags for each word in the text.
Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model
can be stochastic.
The model that includes frequency or probability (statistics) can be called stochastic.
Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic
tagger.
The simplest stochastic tagger applies the following approaches for POS tagging −
Word Frequency Approach
In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs
with a particular tag. We can also say that the tag encountered most frequently with the word in the training
set is the one assigned to an ambiguous instance of that word. The main issue with this approach is that it
may yield inadmissible sequence of tags.
Tag Sequence Probabilities
It is another approach of stochastic tagging, where the tagger calculates the probability of a given sequence
of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is
determined by the probability at which it occurs with the n previous tags.
• This is in contrast to rule-based POS tagging, which assigns tags to words based
on pre-defined rules, and to statistical POS tagging, which relies on a trained
model to predict tags based on probability.
• Ambiguity: Some words can have multiple POS tags depending on the context in which they appear,
making it difficult to determine their correct tag. For example, the word “bass” can be a noun (a type of
fish) or an adjective (having a low frequency or pitch).
• Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger can
be difficult to tag accurately, especially if they are rare or specific to a particular domain.
• Complex grammatical structures: Languages with complex grammatical structures, such as languages
with many inflections or free word order, can be more challenging to tag accurately.
• Lack of annotated training data: Some languages or domains may have limited annotated training
data, making it difficult to train a high-performing POS tagger.
• Inconsistencies in annotated data: Annotated data can sometimes contain errors or inconsistencies,
which can negatively impact the performance of a POS tagger.
28-01-2025 NATURAL LANGUAGE PROCESSING
STUDENT EVALUATION
• Word2Vec: Word2Vec is a popular word embedding technique that learns continuous word
representations from large amounts of text data. It offers two algorithms: Continuous Bag of Words
(CBOW) and Skip-gram. These models generate dense word vectors that capture semantic
similarities between words based on their context.
• GloVe (Global Vectors for Word Representation): GloVe is another widely used method for
learning word embeddings. It combines the global co-occurrence statistics of words in a corpus to
create word vectors. GloVe embeddings capture both semantic and syntactic relationships between
words.
• Fast Text: Fast Text is an extension of Word2Vec that represents each word as a bag of character
n-grams. It can generate word embeddings for out-of-vocabulary words based on their
character-level information, making it useful for handling misspellings and rare words.
• Let's demonstrate a simple example of word embeddings using Word2Vec, one of the popular
techniques for learning word representations. For this example, we will use a small dataset of movie
reviews and create word embeddings using the Word2Vec algorithm.
• Step 1: Preprocess the Data Suppose we have the following movie reviews:
1. "The movie was fantastic, with amazing special effects."
2. "The plot was engaging and kept me hooked till the end."
3. "The acting was superb, especially by the lead actor."
4. "The film had stunning visuals and great cinematography."
• We need to preprocess the data by tokenizing the sentences and converting the text to lowercase
• Step 2: Train Word2Vec Model Next, we train a Word2Vec model using the tokenized reviews
• Step 3: Retrieve Word Embeddings Now, we can access the word embeddings for specific words
using the trained Word2Vec model
• Step 4: Similar Words We can also find words similar to a given word based on their embeddings
• Step 5: Word Similarity Additionally, we can measure the similarity between two words
28-01-2025 NATURAL LANGUAGE PROCESSING
Word 2 Vector Example
• The resulting word embeddings and similarity scores will depend on the specific
corpus and the number of training iterations, but they should capture the semantic
relationships between words based on their context in the reviews.
• For instance, "fantastic" and "amazing" are likely to have a high similarity score, as
they both frequently appear together in positive contexts in the dataset. Similarly,
"plot" and "visuals" might also have a reasonable similarity score if they co-occur
in sentences discussing movie elements.
• Language modeling is a fundamental task in Natural Language Processing (NLP) that involves building a
statistical model to predict the probability distribution of words in a given language.
• The language model learns the patterns and relationships between words in a corpus of text and can be used
to generate new text, evaluate the likelihood of a sentence, perform speech recognition, machine
translation.
• In language modeling, the primary goal is to estimate the probability of a sequence of words (a sentence or
a phrase) using the conditional probability of each word given its preceding context.
• The model learns from large amounts of text data to predict the likelihood of a particular word given the
previous words in a sentence.
There are different types of language models, but two prominent approaches are
• N-gram Language Models
• Neural Language Models
N-gram Language Models: N-gram language models are simple and widely used in early NLP
tasks. An N-gram model predicts the probability of a word based on the previous (N-1) words in a
sentence.
For example, a trigram model (3-gram) predicts the probability of a word given the two preceding
words. The model estimates the probabilities based on the frequency of word sequences observed
in the training data.
Neural Language Models: Neural language models, such as recurrent neural networks (RNNs)
and transformer-based models, have gained significant popularity in recent years due to their
ability to capture long-range dependencies and contextual information
These models learn complex patterns in the language and can generate more coherent and
contextually relevant text.
• Recurrent Neural Networks (RNNs): RNNs are a class of neural networks designed for sequential
data processing. They process input sequences step by step, maintaining a hidden state that captures
information from previous steps.
• This hidden state acts as the context for the current word prediction. However, RNNs have
challenges with capturing long-range dependencies and can suffer from vanishing or exploding
gradients.
• Transformer-based Models: Transformer-based models, like the famous BERT (Bidirectional
Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) series,
have revolutionized language modeling.
• https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html
2) This is e book which can be followed
• https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
3) This is the channel by dan jurafsky and manning where they teach each topic from zero level.
• https://www.youtube.com/watch?v=808M7q8QX0E&list=PLaZQkZp6WhWyvdiP49JG-rjyTPck_hvEu
4) https://www.shiksha.com/online-courses/articles/pos-tagging-in-nlp/
5) https://web.stanford.edu/~jurafsky/slp3/