NLP Module 1
NLP Module 1
Processing – CSE3015
▪ Preprocessing techniques
▪ Tokenization, stemming, lemmatization, stop word removal, rare word removal, spell correction.
Dr D Paul Joseph 2
Basics of NLP
Dr D Paul Joseph 3
ELIZA was an early History of NLP
natural language
processing system
capable of carrying
on a limited form of
conversation with a
user
Dr D Paul Joseph 4
History of NLP
2000’s – 2010’s –
Mid 1950’s – Mid Late 1980’s and
Mid 1960’s – Mid 1970’s and early Statistics Emergence of
1960’s: Birth of 1990’s:
1970’s: A Dark 1980’s – Slow powered by embedding
NLP and Statistical
Era Revival of NLP Linguistic model and deep
Linguistics Revolution
Insights neural networks
▪At first, people thought ▪ After the initial hype, a ▪Some research activities • The computing power • With more • Several models:
NLP is easy! dark era follows. revived, but the emphasis increased sophistication with Word2Vec, Glove,
Researchers predicted is still on linguistically substantially. the statistical models, fastText, Elmo, BERT,
• People started
that “machine oriented, working on richer linguistic COLBERT, GTP[1-3.5]
believing that small toy problems with • Data-driven statistical
translation” can be machine translation is approaches with representation starts • New techniques
solved in 3 years or so weak empirical
impossible, and most evaluation simple representation finding a new value brought attention to
•Mostly hand- coded abandoned research win over complex more complex tasks
rules / linguistic for NLP hand-coded linguistic
‐oriented approaches. rules
•The 3-year project
continued for 10 years,
but still no good result,
despite the significant
amount of expenditure
Dr D Paul Joseph 5
Challenges in NLP
Dr D Paul Joseph 6
Challenges contd..
▪ Ambiguity - sentences and phrases that potentially have two or more possible interpretations.
▪ Lexical
▪ The ambiguity of a single word is called lexical ambiguity. A word that could be used as a verb, noun, or adjective
▪ Ex: bat(Noun or object)? I made it( Made→ created or cooked)
▪ Can be solved by Part-of-Speech tagging and Word-sense ambiguity
▪ Semantic
▪ This kind of ambiguity occurs when the meaning of the words themselves can be misinterpreted
▪ “The car hit the pole while it was moving.
▪ It -> Car or pole? Ambiguity in entities.
▪ Can be solved by Probabilistic parsing
▪ Syntactic
▪ when a sentence is parsed in different ways
▪ “The man saw the girl with the telescope”
▪ Anaphoric
▪ the use of anaphora entities in discourse
▪ “the horse ran up the hill. It was very steep. It soon got tired”
▪ Pragmatic
▪ knowledge of the relationship of meaning to the goals and intentions of the speaker
▪ situation where the context of a phrase gives it multiple interpretations
▪ arises when the statement is not specific. Ex:- “I Dr D Paul
like you Joseph
too” 7
Challenges - Ambiguity
▪ Include your children when baking
cookies
▪ Local High School Dropouts Cut in
Half
▪ Hospitals are Sued by 7 Foot Doctors
▪ Iraqi Head Seeks Arms
▪ Safety Experts Say School Bus
Passengers Should Be Belted
▪ Teacher Strikes Idle Kids
Dr D Paul Joseph 8
Pronoun Challenges - Ambiguity
Reference
Ambiguity
Dr D Paul Joseph 9
Challenges contd..
▪ Errors in text or speech
▪ Misspelled or misused words can create problems for text analysis.
▪ Domain-specific language
▪ Different businesses and industries often use very different language.
▪ Low-resource languages
▪ many languages, especially those spoken by people with less access to technology often go overlooked and
under processed
Sentiment Analysis
Question Answering Spam Detection
Google Home , Alexa
Spelling correction
Chatbot
Machine Translation
Dr D Paul Joseph 11
Machine
Translation
Dr D Paul Joseph 12
Dialog
Systems
Dr D Paul Joseph 13
Sentiment or Twitter analysis
Dr D Paul Joseph 14
Text Classification
Dr D Paul Joseph 15
Question & Answer
Dr D Paul Joseph 16
Digital Personal Assistant
Dr D Paul Joseph 17
Information Extraction – Unstructured
text to database entries
Dr D Paul Joseph 18
Language
Comprehension
Dr D Paul Joseph 19
Introduction to NLTK
Toolkit required: NLTK Programming Language - Installing - pip install nltk A variety of tasks can be Packages
Python performed using NLTK are nltk.classify
Tokenization nltk.cluster
Stemming Nltk.parse
Lemmatization Nltk.stem
Dr D Paul Joseph 20
Text wrangling
Text wrangling is basically the pre-processing work that’s done to prepare raw text data
ready for training.
Simply it is the process of cleaning your data to make it readable by your program, and then
formatting it as such.
It includes:
• Tokenization
• Stop word removal
• Stemming
• Lemmatization
• Rare word removal
• Spell correction
Dr D Paul Joseph 21
▪ Breaking the raw text into small chunks, called as tokens.
▪ These tokens help in understanding the context or developing the model for the
NLP
▪ 2 Types:
▪ Sentence –level
▪ from nltk.tokenize import sent_tokenize
▪ Word – level
▪ from nltk.tokenize import word_tokenize
Dr D Paul Joseph 22
▪ The words which are generally filtered out before processing a natural language are called stop words.
▪ These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add
much information to the text.
▪ By removing commonly used words that do not contribute much to the context, search systems are able to process data more quickly
and accurately. stop words help to eliminate low information words from the text, allowing NLP algorithms to focus on the words that
are more significant and provide context
▪ Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
▪ text = "Nick likes to play football, however he is not too fond of tennis."
▪ text_tokens = word_tokenize(text)
▪ print(tokens_without_sw)
▪ Stemming essentially strips affixes from words, leaving only the base form.
▪ Issues:
▪ Over Stemming (two semantically distinct words are reduced to the same root, and so conflated) Ex: Wander - >Wand
▪ Under Stemming (when two words semantically related are not reduced to the same root) Ex:- Knavish -> Knavish and Knave -
>Knave (Dishonest)
▪ Types:
▪ Lovins Stemmer
Dr D Paul Joseph 25
Implementation
▪ from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
▪ porter = PorterStemmer()
▪ lancaster = LancasterStemmer()
▪ snowball = SnowballStemmer(language='english')
▪ regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
▪ word_list = ["friend", "friendship", "friends", "friendships"]
▪ print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp
Stemmer'))
▪ for word in word_list:
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),reg
exp.stem(word)))
Problems in stemming
Dr D Paul Joseph 27
▪ Lemmatization takes a word and breaks it down to its lemma.
▪ For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional
endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its
lemma, "walk.“
▪ word "saw" might be interpreted differently, depending on the sentence.
▪ For example, "saw" can be broken down into the lemma "see" or "saw."
▪ In these cases, lemmatization attempts to select the right lemma depending on the context of
the word, surrounding words and sentence.
▪ Other words, such as "better" might be broken down to a lemma such as "good."
▪ search engine algorithms use lemmatization, the user can query any inflectional form of
a word and get relevant results.
•Artificial intelligence (AI).
▪ For example, if the user queries the plural form of a word such as "routers," the search
•Big data analytics. engine knows to also return relevant content that uses the singular form of the same
word -- "router."
•Chatbots.
▪ Stemming operates without any contextual knowledge, meaning that it can't
•Machine learning (ML).
discern between similar words with different meanings.
•NLP.
▪ Less complex than lemmatization
•Search queries.
▪ Faster than lemmatization
•Sentiment analysis. Dr D Paul Joseph 28
▪ from nltk import FreqDist
▪ tokens=['hi','i','am','am','whatever','this','is','just','a','test','test','j
ava','python','java']
▪ Some times we need to remove the words that are
▪ freq_dist = FreqDist(tokens)
very unique in nature like names, brands, product
▪ sorted_tokens=dict(sorted(freq_dist.items(), key=lambda
names, and some of the noise characters, such as x:x[1]))
html leftouts. ▪ final_tokens=[]
▪ Few methods are available in nltk library to correct the spelling of the incorrect words.
▪ Cosine Similarity
▪ Euclidean Distance
▪ Hamming Distance
import nltk
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()
# list of incorrect spellings
# that need to be corrected
incorrect_words=['happpy', 'azmaing', 'intelliengt']
# loop for finding correct spellings based on jaccard distance and printing the correct word
for word in incorrect_words:
temp = [(jaccard_distance(set(ngrams(word, 2)),set(ngrams(w, 2))),w) for w in correct_words if
w[0]==word[0]]
print(sorted(temp, key = lambda val:val[0])[0][1])
Dr D Paul Joseph 31
Jaccard Distance
▪ the opposite of the Jaccard coefficient, is used to measure the dissimilarity between two
sample sets.
▪ We get Jaccard distance by subtracting the Jaccard coefficient from 1.
▪ We can also get it by dividing the difference between the sizes of the union and the intersection
of two sets by the size of the union.
▪ We work with Q-grams (these are equivalent to N-grams) which are referred to as characters
instead of tokens.
▪ Jaccard Distance is given by the following formula
Dr D Paul Joseph 32
Example
▪ Doc_1= “educative is the best platform out there” ▪ Application:
▪ Netflix could represent customers
▪ Doc_2=“educative is a new platform” as multisets of movies watched.
▪ It uses Jaccard distance to measure
▪ ***Tokenizing the sentences*** the similarity between two
▪ words_doc_1 = {'educative', 'is', 'the', 'best', 'platform', 'out', 'there'} customers, i.e. how close their
tastes are.
▪ words_doc_2 = {'educative', 'is', 'a', 'new', 'platform’} ▪ Then, based on the preferences of
two users and their similarity, we
▪ The intersection or the common words between the documents are: could potentially make
▪ {'educative', 'is', 'platform’}. recommendations to one or the
other.
▪ 3 words are familiar.
▪ The union or all the words in the documents are:
▪ {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new’}.
▪ Totally, there are 9 words.
Dr D Paul Joseph 33
▪ Hence, the Jaccard similarity is 3/9 = 0.333
2. Edit distance method
▪ Edit Distance measures dissimilarity between two strings by finding the minimum number of
operations needed to transform one string into the other.
▪ The transformations that can be performed are:
Dr D Paul Joseph 35
▪ To understand and generate text, NLP-powered systems must be able to recognize words,
grammar, and a whole lot of language nuances. For computers, this is easier said than done
because they can only comprehend numbers.
▪ To bridge the gap, NLP experts developed a technique called word embeddings that convert
words into their numerical representations. Once converted, NLP algorithms can easily
digest these learned representations to process textual information.
▪ Word embeddings map the words as real-valued numerical vectors. It does so by tokenizing
each word in a sequence (or sentence) and converting them into a vector space. Word
embeddings aim to capture the semantic meaning of words in a sequence of text. It assigns
similar numerical representations to words that have similar meanings.
Dr D Paul Joseph 36
Why ?
Capturing semantic meaning: Word Dimensionality reduction: In contrast to
embeddings allow us to quantify and traditional bag-of-words models, where
categorize semantic similarities each unique word in the corpus is
between linguistic items. They provide a assigned a unique dimension, word
rich representation of words where the embeddings map words into a lower-
semantics are embedded in the dimensional space where the
dimensions of the vector space, making dimensions represent semantic features.
it possible for algorithms to understand This makes word embeddings more
the relationships between words. computationally efficient.
Dr D Paul Joseph 37
Types
One Hot Encoding
TF-IDF
Word2vec
GloVe
FastText
Dr D Paul Joseph 38
1. One hot encoding
Dr D Paul Joseph 39
▪ Sentence: I am teaching NLP in Python
▪ Since a dictionary is defined as the list of all unique words present in the
sentence. So, a dictionary may look like –
▪ Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]
▪ Therefore, the vector representation in this format according to the above
dictionary is
▪ Vector for NLP: [0,0,0,1,0,0]
▪ Vector for Python: [0,0,0,0,0,1]
Dr D Paul Joseph 40
Disadvantages
▪ The Size of the vector is equal to the count of unique
words in the vocabulary.
▪ One-hot encoding does not capture the relationships
between different words. Therefore, it does not convey
information about the context
Dr D Paul Joseph 41
2. Bag-of-Words
▪ One of the popular word embedding techniques of text where each value
in the vector would represent the count of words in a
document/sentence.
▪ In other words, it extracts features from the text., which we also refer to
it as vectorization.
▪ 2 approaches:
▪ Tokenization
▪ Vectorization
Dr D Paul Joseph 42
Working of BOW
▪ Next, the sentences tokenized in the first step have further tokenized
words.
Dr D Paul Joseph 43
▪ The idea is to treat each document as a bag, or a collection, of
words, and then count the frequency of each word in the document.
Dr D Paul Joseph 44
▪ Review 1: This movie is very scary and long
▪ Review 2: This movie is not scary and is slow
▪ Review 3: This movie is spooky and good
▪ Vocabulary consists of 11 words
▪ ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
▪ We can now take each of these words and mark their occurrence in the three movie
reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews
Dr D Paul Joseph 45
from sklearn.feature_extraction.text import CountVectorizer
# Sample data
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',]
print(X.toarray())
print(vectorizer.get_feature_names_out())
▪ Terminology
▪ Term frequency(TF)
Dr D Paul Joseph 48
3.1 Terminology
▪ t — term (word).
▪ d — document (set of words).
▪ N — count of corpus.
▪ corpus — the total document set.
Dr D Paul Joseph 49
▪ Term Frequency:
▪ The number of times a term occurs in a document is called its term frequency.
▪ The weight of a term that occurs in a document is simply proportional to the term frequency
▪ Document Frequency:
▪ The only difference is that TF is a frequency counter for a term t in document d, whereas DF is the count
of occurrences of term t in the document set N.
Dr D Paul Joseph 50
▪ Inverse Document Frequency:
▪ While computing TF, all terms are considered equally important.
▪ However, certain terms, such as “is,” “of,” and “that,” may appear a lot of times but have little importance.
▪ We need to weigh down the frequent terms while scaling up the rare ones.
▪ When we compute IDF, an inverse document frequency factor is incorporated, which diminishes the weight of terms
that occur very frequently in the document set and increases the weight of terms that rarely occur.
▪ IDF is the inverse of the document frequency, which measures the informativeness of term t. When we calculate IDF, it
will be very low for the most occurring words, such as stop words like “is.” That’s because those words are present in
almost all of the documents, and N/df will give a very low value to words like that.
▪ idf(t) = N/df
▪ If you have a large corpus, say 100,000,000, the IDF value explodes.
▪ To avoid this, we take the log of IDF. During the query time, when a word that’s not in the vocabulary occurs, the DF
will be 0. Since we can’t divide by 0, we smoothen the value by adding 1 to the denominator.
Dr D Paul Joseph 51
TF-IDF Implementation
▪ TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus.
▪ Imagine the term t appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of t can be
calculated as follow:
▪ Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents
contain the term t, Inverse Document Frequency (IDF) of t can be calculated as follows
▪ Using these two quantities, we can calculate TF-IDF score of the term t for the document.
Dr D Paul Joseph 52
▪ import pandas as pd ▪ for w in words:
▪ import numpy as np ▪ df_tf[w][i] = df_tf[w][i] + (1 / len(words))
▪ corpus = ['data science is one of the most important fields of ▪ df_tf
science', 'this is one of the best data science courses', 'data
scientists analyze data’ ] ▪ #computing IDF
Dr D Paul Joseph 56
C-BOW(Continuous Bag-of-Words (CBOW)
▪ CBOW is a technique where, given the neighboring words, the
center word is determined.
▪ If our input sentence is “I am reading the book.”, then the
input pairs and labels for a window size of 3 would be:
▪ We start with the one-hot encodings of I and reading (shape 1x5), multiplying those encodings with an
encoding matrix of shape 5x3. The result is a 1x3 hidden layer.
▪ This hidden layer is now multiplied by a 3x5 decoding matrix to give us our prediction of a 1x5 shape. This
is compared to the actual label (am) one-hot encoding of the same shape to complete the architecture.
Dr D Paul Joseph 58
Skip-Gram Model
▪ Given the center word, we have to predict its
neighboring words. Quite literally the opposite of
CBOW, but more efficient.
▪ Let our given input sentence be “I am reading the
book.” The corresponding Skip-Gram pairs for a
window size of 3 would be:
Dr D Paul Joseph 59
▪ Vocabulary size= 5, and we will assume there are 3 embedding dimensions for simplicity.
▪ Starting with the encoding matrix, we grab the vector located at the index of our center word (am in this case). Transposing it,
we now have a 3x1 vector representation of the word am(since we are directly grabbing a row of the encoding matrix,
this WILL NOT be a one-hot encoding).
▪ Multiply this vector representation with the decoding matrix of shape 5x3, giving us the predicted output of shape 5x1. Now,
this vector will essentially be a SoftMax representation over the whole vocabulary, pointing to the indices belonging to the
neighboring words of our input center word. In this case, the output should point to the indices of I and reading
.
Dr D Paul Joseph 60
Training Word2Vec
1. Initialization of vectors
1. Initially high dimension upto 1000-D
2. Random initialization breaks symmetry and ensure that model learns something useful as it starts training.
3. During training, based on objective function, vectors of similar contextual words are positioned nearer.
2. Optimization techniques and Backpropagation
1. To capture linguistic context of words
2. To iteratively adjust the word vectors so that the model’s predictions align more closely with the actual context words.
3. Backpropagation is a method used in neural networks to calculate the gradient of the loss function with respect to the weights of the
network. In the context of Word2Vec, backpropagation adjusts the word vectors based on the errors in predicting context words. Through
successive iterations, the model becomes increasingly accurate in its predictions, leading to optimized word vectors.
3. Window size
1. Words within the window are considered as context words, while those outside are ignored.
2. A smaller window size results in learning more about the word’s syntactic roles, while a larger window size helps the model understand
the broader semantic context.
4. Negative Sampling and Subsampling of frequent words
1. Negative sampling addresses the issue of computational efficiency by updating only a small percentage of the model’s weights at each
step rather than all of them. This is done by sampling a small number of “negative” words (words not in the context) to update for each
target word.
2. Subsampling of frequent words helps in improving the quality of word vectors. The basic idea is to reduce the impact of high-frequency
words in the training process as they often carry less meaningful information compared to rare words.
3. By randomly discarding some instances of frequent words, the model is forced to focus more on the rare words, leading to more balanced
Dr D Paul Joseph 61
and meaningful word vectors.
Things to remember
Dr D Paul Joseph 62
4. GloVe5. Glove
Embeddings
Embeddings
Global Vectors
▪ It is an unsupervised learning algorithm developed by researchers at Stanford
University aiming to generate word embeddings by aggregating global word co-
occurrence matrices from a given corpus.
▪ word2vec and GloVe provide distinct vector representations for the words in the vocabulary.
▪ FastText provides embeddings for character n-grams, representing words as the average of these
embeddings .
▪ Word2Vec model provides embedding to the words, whereas fastText provides embeddings to the
character n-grams. Like the word2vec model, fastText uses CBOW and Skip-gram to compute the
vectors.
▪ FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings
that are not present at the time of training.
▪ Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not
present in the model’s vocabulary.
Dr D Paul Joseph 66
▪ In FastText, each word is represented as the average of the vector representation of
its character n-grams along with the word itself.
▪ Consider the word “equal” and n = 3, then the word will be represented by character
n-grams:
▪ < eq, equ, qua, ual, al > and < equal >
▪ the word embedding for the word ‘equal’ can be given as the sum of all vector
representations of all of its character n-gram and the word itself.
Dr D Paul Joseph 67
FastText - CBOW
Dr D Paul Joseph 69
Word2vec vs Fasttext
▪ Word2Vec works on the word level, while fastText works on the character n-grams.
▪ FastText uses the hierarchical classifier to train the model; hence it is faster than
word2vec.
Dr D Paul Joseph 70
Thank you
Dr D Paul Joseph,
Asst Prof, Sr Gr-I,
Department of Network and Security,
School of Computer Science and Engineering,
VIT-Amaravathi
[email protected]