NLP Notes Unit-1
NLP Notes Unit-1
Phonology – This science helps to deal with patterns present in the sound and speeches
related to the sound as a physical entity.
Pragmatics – This science studies the different uses of language.
Morphology – This science deals with the structure of the words and the systematic relations
between them.
Syntax – This science deal with the structure of the sentences.
Semantics – This science deals with the literal meaning of the words, phrases as well as
sentences.
APPLICATIONS OF NLP
Search Autocorrect and Autocomplete: Whenever you search something on Google, after
typing 2-3 letters, it shows you the possible search terms.
Language Translator: Have you ever used Google Translate to find out what a particular
word or phrase is in a different language? I’m sure it’s a YES!!
Social Media Monitoring: More and more people these days have started using social media
for posting their thoughts about a particular product, policy, or matter.
Chatbots: Customer service and experience is the most important thing for any company.
Survey Analysis: Surveys are an important way of evaluating a company’s performance.
Targeted Advertising: One day I was searching for a mobile phone on Amazon, and a few
minutes later, Google started showing me ads related to similar mobile phones.
Hiring and Recruitment: The Human Resource department selects the most important job of
selecting the right employees for a company.
Voice Assistants: Google Assistant, Apple Siri, Amazon Alexa
Grammar Checkers: Most widely used applications of natural language processing.
Email Filtering: Have you ever used Gmail?
Sentiment Analysis: Natural language understanding is particularly difficult for machines
when it comes to opinions, given that humans often use sarcasm and irony.
Text Classification: Text classification, a text analysis task that also includes sentiment
analysis, involves automatically understanding, processing, and categorizing unstructured
text.
Text Extraction: Text extraction, or information extraction, automatically detects specific
information in a text, such as names, companies, places, and more. This is also known
as named entity recognition.
Machine Translation: Machine translation (MT) is one of the first applications of natural
language processing.
Text Summarization: There are two ways of using natural language processing to
summarize data: extraction-based summarization ‒ which extracts keyphrases and creates a
summary, without adding any extra information ‒ and abstraction-based
summarization, which creates new phrases paraphrasing the original source.
Text-based applications: Text-based applications involve the processing of written text, such
as books, newspapers, reports, manuals, e-mail messages, and so on. These are all reading-
based tasks. Text-based natural language research is ongoing in applications such as
o finding appropriate documents on certain topics from a data-base of texts (for example,
finding relevant books in a library)
o extracting information from messages or articles on certain topics (for example, building
a database of all stock transactions described in the news on a given day)
o translating documents from one language to another (for example, producing automobile
repair manuals in many different languages)
o summarizing texts for certain purposes (for example, producing a 3-page summary of a
1000-page government report)
Dialogue-based applications involve human-machine communication. Most naturally this
involves spoken language, but it also includes interaction using keyboards. Typical potential
applications include
o question-answering systems, where natural language is used to query a database (for example,
a query system to a personnel database)
o automated customer service over the telephone (for example, to perform banking transactions
or order items from a catalogue)
o tutoring systems, where the machine interacts with a student (for example, an automated
mathematics tutoring system)
o spoken language control of a machine (for example, voice control of a VCR or computer)
o general cooperative problem-solving systems (for example, a system that helps a person plan
and schedule freight shipments)
LEVELS OF NLP
REGULAR EXPRESSIONS
Regex or Regular expression, a language for specifying text search strings.
Practical application is used in every computer language, in text processing tools like the
Unix tools grep, and in editors like vim or Emacs.
An algebraic notation for characterizing a set of strings.
Regular expressions are particularly useful for searching in texts.
Disjunctions
Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any one digit
Convenient aliases
Pattern Expansion Matches Examples
\d [0-9] Any digit Fahreneit 451
\D [^0-9] Any non-digit Blue Moon
\w [a-ZA-Z0-9_] Any alphanumeric or _ Daiyu
\W [^\w] Not alphanumeric or _ Look!
\s [ \r\t\n\f] Whitespace (space, tab) Look␣up
\S [^\s] Not whitespace Look up
More Disjunction: The pipe symbol | for disjunction
Pattern Matches
groundhog|woodchuck woodchuck
yours|mine Yours
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck Woodchuck
Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Words: Eg: they lay back on the San Francisco grass and looked at the stars and their
Type: an element of the vocabulary.
Token: an instance of that type in running text.
How many?
◦ 15 tokens (or 14)
◦ 13 types (or 12) (or 11?)
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law =
where often .67 < β < .75
i.e., vocabulary size grows with > square root of the number of word tokens
Edit Distance
• The minimum edit distance between two strings
• Is the minimum number of editing operations
• Insertion
• Deletion
• Substitution
• Needed to transform one into the other
METHODS OF MORPHOLOGY
• Morpheme Based Morphology: In these words, are analysed as arrangements of morphemes.
Word-based morphology is (usually) a word-and-paradigm approach. Instead of stating rules to
combine morphemes into word forms or to generate word forms from stems, word-based morphology
states generalizations that hold between the forms of inflectional paradigms.
• Lexeme Based Morphology: Lexeme-based morphology usually takes what it is called an “item-
and-process” approach. Instead of analysing a word form as a set of morphemes arranged in
sequence, a word form is said to be the result of applying rules that alter a word-form or steam in
order to produce a new one.
• Word based Morphology: Word-based morphology is usually a word-and -paradigm approach
instead of stating rules to combine morphemes into word forms.
INFLECTIONAL, DERIVATIONAL
CLITICIZATION
In morphosyntax, cliticization is a process by which a complex word is formed by attaching a clitic to
a fully inflected word.
Clitic: a morpheme that acts like a word but is reduced and attached to another word.
I've, l'opera
MORPHOLOGICAL ANALYSIS
Morphological analysis:
token lemma + part of speech + grammatical features
Examples:
cats cat+N+plur
played play+V+past
katternas katt+N+plur+def+gen
TOKENIZATION
Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are
typically words or sub-words in the context of natural language processing.
Types of Tokenization
1. Word Tokenization: Word tokenization divides the text into individual words. Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]
2. Sentence Tokenization: The text is segmented into sentences during sentence tokenization.
Example:
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller
units."]
3. Subword Tokenization: Subword tokenization entails breaking down words into smaller
units, which can be especially useful when dealing with morphologically rich languages or
rare words.
Example:
Input: "tokenization"
Output: ["token", "ization"]
4. Character Tokenization: This process divides the text into individual characters.
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
Limitations of tokenization
• May struggle with ambiguity: Tokenization with handling language ambiguities are difficult
• Can be resource intensive: Time-consuming, if need to define a specialized rule.
• Might have token loss: Some tokenization processes lose information.
• May struggle with punctuation: Apostrophes or dashes, can sometimes be tricky
STEMMING
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or
to the roots of words known as a lemma.
Types of Stemmer
1. Porter Stemmer: Martin Porter invented the Porter Stemmer or Porter algorithm in 1980. Five
steps of word reduction are used in the method, each with its own set of mapping rules. Porter
Stemmer is the original stemmer and is renowned for its ease of use and rapidity. Frequently, the
resultant stem is a shorter word with the same root meaning.
Step 1a
SSES → SS
IES → I
SS → SS
S →
Step 1b
(m>0) EED → EE
(*v*) ED →
(*v*) ING →
If the second or third of the rules in Step 1b is successful, the following is performed.
AT → ATE
BL → BLE
IZ → IZE
(*d and not (*L or *S or *Z)) → single letter
(m=1 and *o) → E
Step 1c
(*v*) Y → I
Step 2
(m>0) ATIONAL → ATE
(m>0) TIONAL → TION
(m>0) ENCI → ENCE
(m>0) ANCI → ANCE
(m>0) IZER → IZE
(m>0) ABLI → ABLE
(m>0) ALLI → AL
(m>0) ENTLI → ENT
(m>0) ELI → E
(m>0) OUSLI → OUS
(m>0) IZATION → IZE
(m>0) ATION → ATE
(m>0) ATOR → ATE
(m>0) ALISM → AL
(m>0) IVENESS → IVE
(m>0) FULNESS → FUL
(m>0) OUSNESS → OUS
(m>0) ALITI → AL
(m>0) IVITI → IVE
(m>0) BILITI → BLE
Step 3
(m>0) ICATE → IC
(m>0) ATIVE →
(m>0) ALIZE → AL
(m>0) ICITI → IC
(m>0) ICAL → IC
(m>0) FUL →
(m>0) NESS →
Step 4
(m>1) AL →
(m>1) ANCE →
(m>1) ENCE →
(m>1) ER →
(m>1) IC →
(m>1) ABLE →
(m>1) IBLE →
(m>1) ANT →
(m>1) EMENT →
(m>1) MENT →
(m>1) ENT →
(m>1 and (*S or *T)) ION →
(m>1) OU →
(m>1) ISM →
(m>1) ATE →
(m>1) ITI →
(m>1) OUS →
(m>1) IVE →
(m>1) IZE →
Step 5a
(m>1) E →
(m=1 and not *o) E →
Step 5b
(m > 1 and *d and *L) → single letter
Example:
Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect
2. Snowball Stemmer: Martin Porter also created Snowball Stemmer. The method utilized in this
instance is more precise and is referred to as “English Stemmer” or “Porter2 Stemmer.” It is
somewhat faster and more logical than the original Porter Stemmer.
Few Rules:
ILY -----> ILI
LY -----> Nil
SS -----> SS
S -----> Nil
ED -----> E,Nil
Example:
generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat
3. Lancaster Stemmer: Lancaster Stemmer is straightforward, although it often produces results
with excessive stemming. Over-stemming renders stems non-linguistic or meaningless. Applies rules
to strip suffixes such as "ing" or "ed."
eating ---> eat
eats ---> eat
eaten ---> eat
puts ---> put
putting ---> put
4. Regexp Stemmer: Regex stemmer identifies morphological affixes using regular expressions.
Substrings matching the regular expressions will be discarded.
mass ---> mas
was ---> was
bee ---> bee
computer ---> computer
advisable ---> advis
LEMMATIZATION
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected
forms of a word so they can be analysed as a single item, identified by the word's lemma, or
dictionary form. The association of the base form with a part of speech is often called a lexeme of the
word.
For example: the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form,
'walk', that one might look up in a dictionary, is called the lemma for the word.
For instance:
The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a
dictionary look-up.
The word "walk" is the base form for the word "walking", and hence this is matched in both
stemming and lemmatisation.
The word "meeting" can be either the base form of a noun or a form of a verb ("to meet")
depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow".
Unlike stemming, lemmatisation attempts to select the correct lemma depending on the
context.
Lemmatization Techniques
1. Rule Based Lemmatization: Rule-based lemmatization involves the application of predefined
rules to derive the base or root form of a word.
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example: “walked” --> "walk"
2. Dictionary-Based Lemmatization: Dictionary-based lemmatization relies on predefined
dictionaries or lookup tables to map words to their corresponding base forms or lemmas.
‘running’ -> ‘run’
‘better’ -> ‘good’
‘went’ -> ‘go’
3. Machine Learning-Based Lemmatization: Machine learning-based lemmatization leverages
computational models to automatically learn the relationships between words and their base forms.
Example: When encountering the word ‘went,’ the model, having learned patterns, predicts the base
form as ‘go.’
FEATURE EXTRACTION
• Feature extraction is a process in machine learning and data analysis that involves identifying
and extracting relevant features from raw data.
• NLP feature extraction methods are techniques used to convert raw text data into
numerical representations that can be processed by machine learning models.
• These methods aim to capture the meaningful information and patterns in text data.
1. Label Encoding
2. One Hot Encoding
3. Count Vectorization
TF-IDF Vectorizer
Bag of Words (BOW)
4. Word Embedding
Word2Vec
GloVe
FastText
5. N-gram Features
TF-IDF stands for term frequency-inverse document frequency. The TF–IFD value increases
proportionally to the number of times a word appears in the document and decreases with the
number of documents in the corpus that contain the word. It is composed of 2 sub-parts, which are:
1. Term Frequency (TF)
2. Inverse Document Frequency (IDF)
TF-IDF is used for:
1. Text retrieval and information retrieval systems
2. Document classification and text categorization
3. Text summarization
4. Feature extraction for text data in machine learning algorithms.
2. Inverse Document Frequency (IDF): IDF measures the rarity of a term across
a collection of documents. It is calculated as the logarithm of the ratio of the total number of
documents to the number of documents containing the term. The goal is to penalize words
that are common across all documents.
3. Document Frequency: This tests the meaning of the text, which is very similar to TF, in
the whole corpus collection. The only difference is that in document d, TF is the frequency
counter for a term t, while df is the number of occurrences in the document set N of the term
t. In other words, the number of papers in which the word is present is DF.
df(t) = occurrence of t in documents
Combining TF and IDF: The TF-IDF score for a term in a document is obtained by multiplying
its TF and IDF scores.
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
The input is a sequence x1, x2, …xn of (tokenized) words and a tagset, and the output is a sequence
y1, y2, …yn of tags, each output yi corresponding exactly to one input xi,
Figure: The task of part-of-speech tagging: mapping from input words x1, x2, … xn to output
POS tags y1, y2, … yn .
Ambiguity in NE
For a person, the category definition is intuitively quite clear, but for computers, there is
some ambiguity in classification. Let’s look at some ambiguous example:
England (Organisation) won the 2019 world cup vs The 2019 world cup happened
in England (Location).
Washington (Location) is the capital of the US vs The first president of the US
was Washington (Person).
Methods of NER
One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling.
In addition to labelling the model also requires a deep understanding of context to deal
with the ambiguity of the sentences. This makes it a challenging task for simple machine
learning /
Another way is that Conditional random field that is implemented by both NLP Speech
Tagger and NLTK. It is a probabilistic model that can be used to model sequential data such as
words. The CRF can capture a deep understanding of the context of the sentence. In this model,
the input
Deep Learning Based NER: Deep Learning NER is much more accurate than previous
method, as it is capable to assemble words. This is due to the fact that it used a method
called word embedding, that is capable of understanding the semantic and syntactic
relationship between various words.
Rule Based Named Entity Recognition: Rules are there:
1. If words are repeated, return them as the name.
2. If words are between quotation marks, return them as the name.
3. Return words before company suffixes such as “Inc” as the name
4. Return the entire memo as name if the word count in the memo is fewer than 3. If none
of the extraction patterns is successful, the program outputs nothing and lets the user
know that the program has failed to extract a vendor name from the memo.
N-GRAMS
An N-gram means a sequence of N words.
For example,
“Medium blog” is a 2-gram (a bigram)
“A Medium blog post” is a 4-gram,
“Write on Medium” is a 3-gram (trigram).
The following sentences as the training corpus:
Thank you so much for your help.
I really appreciate your help.
Excuse me, do you know what time it is?
I’m really sorry for not inviting you.
I really like your watch.
Suppose we’re calculating the probability of word “w1” occurring after the word “w2,” then the
formula for this is as follows:
count(w2 w1) / count(w2)
which is the number of times the words occurs in the required sequence, divided by the number of the
times the word before the expected word occurs in the corpus.
From our example sentences, let’s calculate the probability of the word “like” occurring after the
word “really”:
count(really like) / count(really)
=1/3
= 0.33
Now, we know the probability of getting the first word as natural. But, what’s the
probability of getting the next word after getting the word ‘Language‘ after the word
‘Natural‘.
After getting the probability of generating words ‘Natural Language’, what’s the
probability of getting ‘Processing‘.
Probablity:
P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so
transparent that” and we want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a very large corpus, count
the number of times we see its water is so transparent that, and count the number of times this is
followed by the. This would be answering the question “Out of the times we saw the history h, how
many times was it followed by the word w”, as follows:
To estimate the probability of a word w given a history h, or the probability of an entire word
sequence.
Applications of n‐grams:
spelling error detection and correction, query expansion, information retrieval with serial,
inverted and signature files, dictionary look‐up, text compression, and language identification.
VARIETY OF SMOOTHING
To keep the model from assigning zero probability to unseen words/n-grams, we have to shave off a
bit of probability mass from some more frequent words/n-grams and give it to those who have never
seen.
Smoothing is the task of re-evaluating some of the zero probability and low probability estimations.
1. Laplace (add-one) smoothing
2. Add-k smoothing
3. Good Turing smoothing
4. Stupid backoff and Interpolation
5. Kneser-Ney smoothing.
Although add-k is useful for some tasks (including text classification), it turns out that it still doesn’t
work well for language modeling, generating counts with poor variances and often inappropriate
discounts.
5. Kneser-Ney smoothing
Redistributes probabilities to less frequent n-grams while maintaining the distribution of high-frequency n-
grams.