0% found this document useful (0 votes)
0 views71 pages

NLP Module 1

The document provides an overview of Natural Language Processing (NLP), discussing its origins, challenges, and various preprocessing techniques such as tokenization, stemming, and lemmatization. It highlights the history of NLP, key challenges like ambiguity and domain-specific language, and applications including sentiment analysis and chatbots. Additionally, it introduces tools like NLTK for implementing NLP tasks and emphasizes the importance of data preparation for effective NLP modeling.

Uploaded by

rohithvishnuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views71 pages

NLP Module 1

The document provides an overview of Natural Language Processing (NLP), discussing its origins, challenges, and various preprocessing techniques such as tokenization, stemming, and lemmatization. It highlights the history of NLP, key challenges like ambiguity and domain-specific language, and applications including sentiment analysis and chatbots. Additionally, it introduces tools like NLTK for implementing NLP tasks and emphasizes the importance of data preparation for effective NLP modeling.

Uploaded by

rohithvishnuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Natural Language

Processing – CSE3015

Dr. D Paul Joseph,


Assistant Professor. Sr Grade-I,
Dept of Networking and Security,
SCOPE, VIT-AP
[email protected]
Agenda
▪ Overview

▪ Origins and challenges of NLP-Need of NLP

▪ Preprocessing techniques

▪ Text Wrangling, Text cleansing, sentence splitter

▪ Tokenization, stemming, lemmatization, stop word removal, rare word removal, spell correction.

▪ Word Embeddings, Different Types

▪ One Hot Encoding, Bag of Words (BoW), TF-IDF

▪ Static word embeddings: Word2vec, GloVe, FastText

Dr D Paul Joseph 2
Basics of NLP

▪ NLP – Natural Language Processing


▪ It is a field of computer science and artificial intelligence
(AI).
▪ Objective: To enable computers to understand, analyze,
and generate human language in a way that is similar to
how humans do

Dr D Paul Joseph 3
ELIZA was an early History of NLP
natural language
processing system
capable of carrying
on a limited form of
conversation with a
user

Dr D Paul Joseph 4
History of NLP

2000’s – 2010’s –
Mid 1950’s – Mid Late 1980’s and
Mid 1960’s – Mid 1970’s and early Statistics Emergence of
1960’s: Birth of 1990’s:
1970’s: A Dark 1980’s – Slow powered by embedding
NLP and Statistical
Era Revival of NLP Linguistic model and deep
Linguistics Revolution
Insights neural networks

▪At first, people thought ▪ After the initial hype, a ▪Some research activities • The computing power • With more • Several models:
NLP is easy! dark era follows. revived, but the emphasis increased sophistication with Word2Vec, Glove,
Researchers predicted is still on linguistically substantially. the statistical models, fastText, Elmo, BERT,
• People started
that “machine oriented, working on richer linguistic COLBERT, GTP[1-3.5]
believing that small toy problems with • Data-driven statistical
translation” can be machine translation is approaches with representation starts • New techniques
solved in 3 years or so weak empirical
impossible, and most evaluation simple representation finding a new value brought attention to
•Mostly hand- coded abandoned research win over complex more complex tasks
rules / linguistic for NLP hand-coded linguistic
‐oriented approaches. rules
•The 3-year project
continued for 10 years,
but still no good result,
despite the significant
amount of expenditure
Dr D Paul Joseph 5
Challenges in NLP

Dr D Paul Joseph 6
Challenges contd..
▪ Ambiguity - sentences and phrases that potentially have two or more possible interpretations.
▪ Lexical
▪ The ambiguity of a single word is called lexical ambiguity. A word that could be used as a verb, noun, or adjective
▪ Ex: bat(Noun or object)? I made it( Made→ created or cooked)
▪ Can be solved by Part-of-Speech tagging and Word-sense ambiguity
▪ Semantic
▪ This kind of ambiguity occurs when the meaning of the words themselves can be misinterpreted
▪ “The car hit the pole while it was moving.
▪ It -> Car or pole? Ambiguity in entities.
▪ Can be solved by Probabilistic parsing

▪ Syntactic
▪ when a sentence is parsed in different ways
▪ “The man saw the girl with the telescope”

▪ Anaphoric
▪ the use of anaphora entities in discourse
▪ “the horse ran up the hill. It was very steep. It soon got tired”

▪ Pragmatic
▪ knowledge of the relationship of meaning to the goals and intentions of the speaker
▪ situation where the context of a phrase gives it multiple interpretations
▪ arises when the statement is not specific. Ex:- “I Dr D Paul
like you Joseph
too” 7
Challenges - Ambiguity
▪ Include your children when baking
cookies
▪ Local High School Dropouts Cut in
Half
▪ Hospitals are Sued by 7 Foot Doctors
▪ Iraqi Head Seeks Arms
▪ Safety Experts Say School Bus
Passengers Should Be Belted
▪ Teacher Strikes Idle Kids

Dr D Paul Joseph 8
Pronoun Challenges - Ambiguity
Reference
Ambiguity

Dr D Paul Joseph 9
Challenges contd..
▪ Errors in text or speech
▪ Misspelled or misused words can create problems for text analysis.

▪ Colloquialisms and slang


▪ use informal words and expressions.
▪ Informal phrases, expressions, idioms, and culture-specific lingo present a number of problems for NLP.
▪ And cultural slang is constantly morphing and expanding, so new words pop up every day
▪ Ex: Rain check" means "postpone the plan
▪ Slang - Old fogey" means "old person

▪ Domain-specific language
▪ Different businesses and industries often use very different language.

▪ Low-resource languages
▪ many languages, especially those spoken by people with less access to technology often go overlooked and
under processed

▪ Lack of research and development


▪ The more data NLP models are trained on, the smarter they become.
Dr D Paul Joseph 10
Applications of NLP

Sentiment Analysis
Question Answering Spam Detection
Google Home , Alexa

Spelling correction

Chatbot

Machine Translation
Dr D Paul Joseph 11
Machine
Translation

Dr D Paul Joseph 12
Dialog
Systems

Dr D Paul Joseph 13
Sentiment or Twitter analysis

Dr D Paul Joseph 14
Text Classification

Dr D Paul Joseph 15
Question & Answer

Dr D Paul Joseph 16
Digital Personal Assistant

Dr D Paul Joseph 17
Information Extraction – Unstructured
text to database entries

Dr D Paul Joseph 18
Language
Comprehension

Dr D Paul Joseph 19
Introduction to NLTK

Toolkit required: NLTK Programming Language - Installing - pip install nltk A variety of tasks can be Packages
Python performed using NLTK are nltk.classify

Tokenization nltk.cluster

Lower case conversion nltk.corpus

Stop Words removal nltk.metrics

Stemming Nltk.parse

Lemmatization Nltk.stem

Parse tree or Syntax Tree Nltk.tokenize


generation Nltk.twitter
POS Tagging

Dr D Paul Joseph 20
Text wrangling
Text wrangling is basically the pre-processing work that’s done to prepare raw text data
ready for training.

Simply it is the process of cleaning your data to make it readable by your program, and then
formatting it as such.

Also known as data preparation.

It includes:

• Tokenization
• Stop word removal
• Stemming
• Lemmatization
• Rare word removal
• Spell correction
Dr D Paul Joseph 21
▪ Breaking the raw text into small chunks, called as tokens.
▪ These tokens help in understanding the context or developing the model for the
NLP
▪ 2 Types:
▪ Sentence –level
▪ from nltk.tokenize import sent_tokenize
▪ Word – level
▪ from nltk.tokenize import word_tokenize

from nltk.tokenize import sent_tokenize


from nltk.tokenize import word_tokenize text = "God is Great! I won a lottery."
text = "God is Great! I won a lottery." print(sent_tokenize(text))
print(word_tokenize(text))
Output: ['God is Great!', 'I won a lottery ']
Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

Dr D Paul Joseph 22
▪ The words which are generally filtered out before processing a natural language are called stop words.
▪ These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add
much information to the text.
▪ By removing commonly used words that do not contribute much to the context, search systems are able to process data more quickly
and accurately. stop words help to eliminate low information words from the text, allowing NLP algorithms to focus on the words that
are more significant and provide context
▪ Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.

Common stop words: Custom/ Domain-specfic:


▪ most frequently occurring words in a language and are ▪ Depending on the specific task or domain, additional words may
often removed during text preprocessing be considered as stopwords.
▪ “the,” “is,” “in,” “for,” “where,” “when,” “to,” “at,” ▪ in a medical context, words like “dr.”, “patient” or “treatment”
might be considered.
Numerical Stopwords:
▪ Numbers and numeric characters may be treated as Single-Character Stopwords:
stopwords in certain cases, especially when the analysis ▪ Single characters, such as “a,” “I,” “s,” or “x,” may be
is focused on the meaning of the text rather than specific considered stopwords, particularly in cases where they don’t
numerical values. convey much meaning on their own.

from nltk.tokenize import word_tokenize


text = "God is Great! I won a lottery."
print(word_tokenize(text))
Dr D Paul Joseph 23
Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']
Implementing using NLTK
1. NLTK
▪ import nltk 2. Spacy (spacy.load("en_core_web_sm")
3. Genism(from gensim.parsing.preprocessing import )
▪ nltk.download('stopwords') 4. SkLearn/ Scikit-learn
(from sklearn.feature_extraction.text
▪ from nltk.corpus import stopwords import CountVectorizer)

▪ from nltk.tokenize import word_tokenize

▪ text = "Nick likes to play football, however he is not too fond of tennis."

▪ text_tokens = word_tokenize(text)

▪ tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

▪ print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'however', 'fond',


Dr D Paul Joseph 'tennis', '.'] 24
▪ It is the process of reducing inflected form of a word to one so-called “stem,” or root form, also known as a “lemma” in linguistics.

▪ Stemming essentially strips affixes from words, leaving only the base form.

▪ Tokenization is essential before stemming

▪ Issues:

▪ Over Stemming (two semantically distinct words are reduced to the same root, and so conflated) Ex: Wander - >Wand

▪ Under Stemming (when two words semantically related are not reduced to the same root) Ex:- Knavish -> Knavish and Knave -
>Knave (Dishonest)

▪ Types:

▪ Lovins Stemmer

▪ Porter Stemmer(running- run)

▪ Snowball Stemmer (running -> run)

▪ Lancaster Stemmer(running- run)

▪ Regexp Stemmer( running -> runn)

Dr D Paul Joseph 25
Implementation
▪ from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
▪ porter = PorterStemmer()
▪ lancaster = LancasterStemmer()
▪ snowball = SnowballStemmer(language='english')
▪ regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
▪ word_list = ["friend", "friendship", "friends", "friendships"]
▪ print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp
Stemmer'))
▪ for word in word_list:
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),reg
exp.stem(word)))

Word Роrter Snowball Lancaster Regexp


friend friend friend friend friend
friendship friendship friendship friend friendship
friends friend friend friend friend
friendships friendship friendship friend friendship
Dr D Paul Joseph 26
▪ Lemmatization is the process of grouping together the different
inflected forms of a word so they can be analyzed as a single item.
▪ Lemmatization is similar to stemming but it brings context to the
words. So it links words with similar meanings to one word.
▪ It is another way to extract the base form of words, normally aiming
to remove inflectional endings by using vocabulary and
morphological analysis.
▪ After lemmatization, the base form of any word is called lemma.

Problems in stemming

Dr D Paul Joseph 27
▪ Lemmatization takes a word and breaks it down to its lemma.
▪ For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional
endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its
lemma, "walk.“
▪ word "saw" might be interpreted differently, depending on the sentence.
▪ For example, "saw" can be broken down into the lemma "see" or "saw."
▪ In these cases, lemmatization attempts to select the right lemma depending on the context of
the word, surrounding words and sentence.
▪ Other words, such as "better" might be broken down to a lemma such as "good."
▪ search engine algorithms use lemmatization, the user can query any inflectional form of
a word and get relevant results.
•Artificial intelligence (AI).
▪ For example, if the user queries the plural form of a word such as "routers," the search
•Big data analytics. engine knows to also return relevant content that uses the singular form of the same
word -- "router."
•Chatbots.
▪ Stemming operates without any contextual knowledge, meaning that it can't
•Machine learning (ML).
discern between similar words with different meanings.
•NLP.
▪ Less complex than lemmatization
•Search queries.
▪ Faster than lemmatization
•Sentiment analysis. Dr D Paul Joseph 28
▪ from nltk import FreqDist
▪ tokens=['hi','i','am','am','whatever','this','is','just','a','test','test','j
ava','python','java']
▪ Some times we need to remove the words that are
▪ freq_dist = FreqDist(tokens)
very unique in nature like names, brands, product
▪ sorted_tokens=dict(sorted(freq_dist.items(), key=lambda
names, and some of the noise characters, such as x:x[1]))
html leftouts. ▪ final_tokens=[]

▪ This is considered as “rare word removal”. ▪ for x,y in sorted_tokens.items():


▪ if y>1:
▪ Those words are appeared less frequently in the
▪ final_tokens.append(x)
text.
▪ print(final_tokens)
▪ Method
▪ FreqDist() is used to get the distribution of the ▪ Output:

terms in the corpus. ▪ ['am', 'test', 'java']

▪ Can get the less distributed word as rare


words.
Dr D Paul Joseph 29
▪ Spelling corrections are important phase of text cleaning process, since misspelled words
will leads a wrong prediction during machine learning process.

▪ Few methods are available in nltk library to correct the spelling of the incorrect words.

▪ Jaccard distance method

▪ Edit distance method

▪ Cosine Similarity

▪ Euclidean Distance

▪ Hamming Distance

▪ Levenshtein Edit Distance

▪ Longest Common Substring


Dr D Paul Joseph 30
1. Jaccard distance

import nltk
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
nltk.download('words')
from nltk.corpus import words
correct_words = words.words()
# list of incorrect spellings
# that need to be corrected
incorrect_words=['happpy', 'azmaing', 'intelliengt']
# loop for finding correct spellings based on jaccard distance and printing the correct word
for word in incorrect_words:
temp = [(jaccard_distance(set(ngrams(word, 2)),set(ngrams(w, 2))),w) for w in correct_words if
w[0]==word[0]]
print(sorted(temp, key = lambda val:val[0])[0][1])

Dr D Paul Joseph 31
Jaccard Distance
▪ the opposite of the Jaccard coefficient, is used to measure the dissimilarity between two
sample sets.
▪ We get Jaccard distance by subtracting the Jaccard coefficient from 1.
▪ We can also get it by dividing the difference between the sizes of the union and the intersection
of two sets by the size of the union.
▪ We work with Q-grams (these are equivalent to N-grams) which are referred to as characters
instead of tokens.
▪ Jaccard Distance is given by the following formula

Range of 0 to 1. If the two documents are identical,


Jaccard Similarity is 1. The Jaccard similarity score is 0 if
there are no common words between two documents.

Dr D Paul Joseph 32
Example
▪ Doc_1= “educative is the best platform out there” ▪ Application:
▪ Netflix could represent customers
▪ Doc_2=“educative is a new platform” as multisets of movies watched.
▪ It uses Jaccard distance to measure
▪ ***Tokenizing the sentences*** the similarity between two
▪ words_doc_1 = {'educative', 'is', 'the', 'best', 'platform', 'out', 'there'} customers, i.e. how close their
tastes are.
▪ words_doc_2 = {'educative', 'is', 'a', 'new', 'platform’} ▪ Then, based on the preferences of
two users and their similarity, we
▪ The intersection or the common words between the documents are: could potentially make
▪ {'educative', 'is', 'platform’}. recommendations to one or the
other.
▪ 3 words are familiar.
▪ The union or all the words in the documents are:
▪ {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new’}.
▪ Totally, there are 9 words.
Dr D Paul Joseph 33
▪ Hence, the Jaccard similarity is 3/9 = 0.333
2. Edit distance method
▪ Edit Distance measures dissimilarity between two strings by finding the minimum number of
operations needed to transform one string into the other.
▪ The transformations that can be performed are:

• Inserting a new character:


• bat -> bats (insertion of 's')
• Deleting an existing character.
• care -> car (deletion of 'e')
• Substituting an existing character.
• bin -> bit (substitution of n with t)
• Transposition of two existing consecutive characters.
• sing -> sign (transposition of ng to gn)Dr D Paul Joseph 34
3. Cosine Similarity
▪ The cosine similarity measures the proximity between two non-zero vectors.
▪ The cosine similarity of two text units simply computes the cosine of the angle formed by the two vectors
representing the text units, i.e. the inner product of Euclidian space of the normalized vectors.
▪ When close to 1, the two units are close in the chosen vector space, when close to -1, the two units are far
apart.

Dr D Paul Joseph 35
▪ To understand and generate text, NLP-powered systems must be able to recognize words,
grammar, and a whole lot of language nuances. For computers, this is easier said than done
because they can only comprehend numbers.

▪ To bridge the gap, NLP experts developed a technique called word embeddings that convert
words into their numerical representations. Once converted, NLP algorithms can easily
digest these learned representations to process textual information.

▪ Word embeddings map the words as real-valued numerical vectors. It does so by tokenizing
each word in a sequence (or sentence) and converting them into a vector space. Word
embeddings aim to capture the semantic meaning of words in a sequence of text. It assigns
similar numerical representations to words that have similar meanings.

Dr D Paul Joseph 36
Why ?
Capturing semantic meaning: Word Dimensionality reduction: In contrast to
embeddings allow us to quantify and traditional bag-of-words models, where
categorize semantic similarities each unique word in the corpus is
between linguistic items. They provide a assigned a unique dimension, word
rich representation of words where the embeddings map words into a lower-
semantics are embedded in the dimensional space where the
dimensions of the vector space, making dimensions represent semantic features.
it possible for algorithms to understand This makes word embeddings more
the relationships between words. computationally efficient.

Enabling transfer learning: This is a


Handling large
machine learning technique where pre-
vocabularies: Traditional text
trained models are used on a new, but
representation techniques struggle in
related problem. Pre-trained word
the face of vast vocabularies, due to the
embeddings learned from large
curse of dimensionality and sparsity
datasets can be leveraged to improve
issues. By representing words as dense
performance on smaller, related tasks.
vectors, word embeddings can handle
This can significantly reduce the effort
large vocabularies efficiently.
of creating new NLP models.

Dr D Paul Joseph 37
Types
One Hot Encoding

Bag of Words (BoW)

TF-IDF

Word2vec

GloVe

FastText
Dr D Paul Joseph 38
1. One hot encoding

▪ One Hot encoding is a


representation of categorical
variables as binary vectors.
▪ Each integer value is represented
as a binary vector that is all zero
values except the index of the
integer, which is marked with a 1.

Dr D Paul Joseph 39
▪ Sentence: I am teaching NLP in Python

▪ A word in this sentence may be “NLP”, “Python”, “teaching”, etc.

▪ Since a dictionary is defined as the list of all unique words present in the
sentence. So, a dictionary may look like –
▪ Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]
▪ Therefore, the vector representation in this format according to the above
dictionary is
▪ Vector for NLP: [0,0,0,1,0,0]
▪ Vector for Python: [0,0,0,0,0,1]
Dr D Paul Joseph 40
Disadvantages
▪ The Size of the vector is equal to the count of unique
words in the vocabulary.
▪ One-hot encoding does not capture the relationships
between different words. Therefore, it does not convey
information about the context

Dr D Paul Joseph 41
2. Bag-of-Words
▪ One of the popular word embedding techniques of text where each value
in the vector would represent the count of words in a
document/sentence.

▪ In other words, it extracts features from the text., which we also refer to
it as vectorization.

▪ Bag of Words takes a document from a corpus and converts it into a


numeric vector by mapping each document word to a feature vector for
the machine learning model.

▪ 2 approaches:
▪ Tokenization
▪ Vectorization
Dr D Paul Joseph 42
Working of BOW

▪ In the first step, tokenize the text into sentences.

▪ Next, the sentences tokenized in the first step have further tokenized
words.

▪ Eliminate any stop words or punctuation.

▪ Then, convert all the words to lowercase.

▪ Finally, move to create a frequency distribution chart of the words.

Dr D Paul Joseph 43
▪ The idea is to treat each document as a bag, or a collection, of
words, and then count the frequency of each word in the document.

▪ It does not consider the order of words but provides a


straightforward way to convert text into vectors.

▪ Pros: Simple, capture word importance based on frequency

▪ Cons: Ignores word order and context.

Dr D Paul Joseph 44
▪ Review 1: This movie is very scary and long
▪ Review 2: This movie is not scary and is slow
▪ Review 3: This movie is spooky and good
▪ Vocabulary consists of 11 words
▪ ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
▪ We can now take each of these words and mark their occurrence in the three movie
reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews

Dr D Paul Joseph 45
from sklearn.feature_extraction.text import CountVectorizer
# Sample data
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',]

# Initialize the CountVectorizer


vectorizer = CountVectorizer()

# Fit and transform the corpus


X = vectorizer.fit_transform(corpus)

print(X.toarray())
print(vectorizer.get_feature_names_out())

#output of the above code


[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third'Dr'this']
D Paul Joseph 46
3. Term Frequency &
Inverse Document
Frequency
3. TF-IDF

▪ Terminology

▪ Term frequency(TF)

▪ Document frequency (DF)

▪ Inverse document frequency (IDF)

Dr D Paul Joseph 48
3.1 Terminology
▪ t — term (word).
▪ d — document (set of words).
▪ N — count of corpus.
▪ corpus — the total document set.

Dr D Paul Joseph 49
▪ Term Frequency:

▪ The number of times a term occurs in a document is called its term frequency.

▪ The weight of a term that occurs in a document is simply proportional to the term frequency

▪ Document Frequency:

▪ It measures the importance of a document in a whole set of corpus.

▪ The only difference is that TF is a frequency counter for a term t in document d, whereas DF is the count
of occurrences of term t in the document set N.

▪ DF is the number of documents in which the word is present.

▪ df(t) = occurrence of t in documents

Dr D Paul Joseph 50
▪ Inverse Document Frequency:
▪ While computing TF, all terms are considered equally important.

▪ However, certain terms, such as “is,” “of,” and “that,” may appear a lot of times but have little importance.

▪ We need to weigh down the frequent terms while scaling up the rare ones.

▪ When we compute IDF, an inverse document frequency factor is incorporated, which diminishes the weight of terms
that occur very frequently in the document set and increases the weight of terms that rarely occur.

▪ IDF is the inverse of the document frequency, which measures the informativeness of term t. When we calculate IDF, it
will be very low for the most occurring words, such as stop words like “is.” That’s because those words are present in
almost all of the documents, and N/df will give a very low value to words like that.

▪ idf(t) = N/df

▪ If you have a large corpus, say 100,000,000, the IDF value explodes.

▪ To avoid this, we take the log of IDF. During the query time, when a word that’s not in the vocabulary occurs, the DF
will be 0. Since we can’t divide by 0, we smoothen the value by adding 1 to the denominator.

Dr D Paul Joseph 51
TF-IDF Implementation
▪ TF-IDF is a measure used to evaluate how important a word is to a document in a collection or corpus.

▪ Imagine the term t appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of t can be
calculated as follow:

▪ Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents
contain the term t, Inverse Document Frequency (IDF) of t can be calculated as follows

▪ Using these two quantities, we can calculate TF-IDF score of the term t for the document.

Dr D Paul Joseph 52
▪ import pandas as pd ▪ for w in words:
▪ import numpy as np ▪ df_tf[w][i] = df_tf[w][i] + (1 / len(words))
▪ corpus = ['data science is one of the most important fields of ▪ df_tf
science', 'this is one of the best data science courses', 'data
scientists analyze data’ ] ▪ #computing IDF

▪ #creating a word set ▪ print("IDF of: ")

▪ words_set = set() ▪ idf = {}

▪ for doc in corpus: ▪ for w in words_set:


▪ k = 0 # number of documents in the corpus that contain this
▪ words = doc.split(' ') word
▪ words_set = words_set.union(set(words)) ▪ for i in range(n_docs):
▪ print('Number of words in the corpus:',len(words_set)) ▪ if w in corpus[i].split():
▪ print('The words in the corpus: \n', words_set) ▪ k += 1
▪ #computing Term-frequency ▪ idf[w] = np.log10(n_docs / k)
▪ n_docs = len(corpus) #·Number of documents in the corpus ▪ print(f'{w:>15}: {idf[w]:>10}’ )
▪ n_words_set = len(words_set) #·Number of unique words in the ▪ #computing TF-IDF
▪ df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), ▪ df_tf_idf = df_tf.copy()
columns=words_set) ▪ for w in words_set:
▪ # Compute Term Frequency (TF) ▪ for i in range(n_docs):
▪ for i in range(n_docs): ▪ df_tf_idf[w][i] = df_tf[w][i] * idf[w]
▪ words = corpus[i].split(' ') # Words in the document ▪ df_tf_idf
Dr D Paul Joseph 53
4. Word2Vec
▪ Words as vectors
▪ expressing each word in text corpus in an N-dimensional space (embedding space)
▪ The word’s weight in each dimension of that embedding space defines it for the
model.
▪ How are the weights assigned?

▪ We help define the meaning of words based on their context.


▪ The context of a word is defined by its neighboring words. Hence, the meaning of a word will depend
on the words where it is associated.
Dr D Paul Joseph 55
Types
C-BOW Skip gram

Dr D Paul Joseph 56
C-BOW(Continuous Bag-of-Words (CBOW)
▪ CBOW is a technique where, given the neighboring words, the
center word is determined.
▪ If our input sentence is “I am reading the book.”, then the
input pairs and labels for a window size of 3 would be:

▪ The CBOW model


predicts the target
word from its
surrounding context
words.
▪ Uses the surrounding
words to predict the
word in the middle.
▪ Takes all the context
words, aggregates
them, and uses the
resultant vector to
predict the target
word Dr D Paul Joseph 57
▪ input-label pair of (I, reading) – (am).

▪ We start with the one-hot encodings of I and reading (shape 1x5), multiplying those encodings with an
encoding matrix of shape 5x3. The result is a 1x3 hidden layer.

▪ This hidden layer is now multiplied by a 3x5 decoding matrix to give us our prediction of a 1x5 shape. This
is compared to the actual label (am) one-hot encoding of the same shape to complete the architecture.

Dr D Paul Joseph 58
Skip-Gram Model
▪ Given the center word, we have to predict its
neighboring words. Quite literally the opposite of
CBOW, but more efficient.
▪ Let our given input sentence be “I am reading the
book.” The corresponding Skip-Gram pairs for a
window size of 3 would be:

▪ The Skip-Gram model


predicts the surrounding
context words from a target
word.
▪ In other words, it uses a
single word to predict its
surrounding context.

Dr D Paul Joseph 59
▪ Vocabulary size= 5, and we will assume there are 3 embedding dimensions for simplicity.

▪ Starting with the encoding matrix, we grab the vector located at the index of our center word (am in this case). Transposing it,
we now have a 3x1 vector representation of the word am(since we are directly grabbing a row of the encoding matrix,
this WILL NOT be a one-hot encoding).
▪ Multiply this vector representation with the decoding matrix of shape 5x3, giving us the predicted output of shape 5x1. Now,
this vector will essentially be a SoftMax representation over the whole vocabulary, pointing to the indices belonging to the
neighboring words of our input center word. In this case, the output should point to the indices of I and reading
.

Dr D Paul Joseph 60
Training Word2Vec
1. Initialization of vectors
1. Initially high dimension upto 1000-D
2. Random initialization breaks symmetry and ensure that model learns something useful as it starts training.
3. During training, based on objective function, vectors of similar contextual words are positioned nearer.
2. Optimization techniques and Backpropagation
1. To capture linguistic context of words
2. To iteratively adjust the word vectors so that the model’s predictions align more closely with the actual context words.
3. Backpropagation is a method used in neural networks to calculate the gradient of the loss function with respect to the weights of the
network. In the context of Word2Vec, backpropagation adjusts the word vectors based on the errors in predicting context words. Through
successive iterations, the model becomes increasingly accurate in its predictions, leading to optimized word vectors.
3. Window size
1. Words within the window are considered as context words, while those outside are ignored.
2. A smaller window size results in learning more about the word’s syntactic roles, while a larger window size helps the model understand
the broader semantic context.
4. Negative Sampling and Subsampling of frequent words
1. Negative sampling addresses the issue of computational efficiency by updating only a small percentage of the model’s weights at each
step rather than all of them. This is done by sampling a small number of “negative” words (words not in the context) to update for each
target word.
2. Subsampling of frequent words helps in improving the quality of word vectors. The basic idea is to reduce the impact of high-frequency
words in the training process as they often carry less meaningful information compared to rare words.
3. By randomly discarding some instances of frequent words, the model is forced to focus more on the rare words, leading to more balanced
Dr D Paul Joseph 61
and meaningful word vectors.
Things to remember

▪ Window size: Larger the size, higher the complexity


▪ Vectors: assigned random values to the words, then used Back propagation
and SoftMax function.
▪ Aim: to perform dimensionality reduction, and create dense word vectors.
▪ CBOW: Beneficial than Skip gram for small datasets, faster implementation
▪ Skip gram: Works on larger datasets. Improvised ability to capture
semantics relationships, handles rare words and flexible to linguistic
context. Computationally more expensive, as it predicts multiple context
words

Dr D Paul Joseph 62
4. GloVe5. Glove
Embeddings
Embeddings
Global Vectors
▪ It is an unsupervised learning algorithm developed by researchers at Stanford
University aiming to generate word embeddings by aggregating global word co-
occurrence matrices from a given corpus.

▪ Derives semantic relationships between words using word-word co-occurrence


matrix
▪ Creates a matrix on the count, where two words come together.
▪ The key idea behind GloVe is to learn word embeddings by examining the
probability of word co-occurrences across the entire corpus.
▪ basic idea behind the GloVe word embedding is to derive the relationship between
the words from statistics.
▪ The co-occurrence matrix tells you how often a particular word pair occurs together.
Each value in the co-occurrence matrix represents a pair of words occurring
together.
Dr D Paul Joseph 64
Ex: Wsize=1|2
I love NLP
I love python
Create matrix(X) with rows(i) and columns(j) with unique words
Xij=Xtij

How many times the word i occurred with word j


What is the probability Pij=P(Wi | Wj) ==Xij / Xi
P(I | Love) = 2/2 =1
Dr D Paul Joseph 65
▪ FastText is a word embedding technique that provides embedding to the character n-grams.

▪ It is the extension of the word2vec model.

▪ word2vec and GloVe provide distinct vector representations for the words in the vocabulary.

▪ This leads to ignorance of the internal structure of the language.

▪ FastText provides embeddings for character n-grams, representing words as the average of these
embeddings .

▪ Word2Vec model provides embedding to the words, whereas fastText provides embeddings to the
character n-grams. Like the word2vec model, fastText uses CBOW and Skip-gram to compute the
vectors.

▪ FastText can also handle out-of-vocabulary words, i.e., the fast text can find the word embeddings
that are not present at the time of training.

▪ Out-of-vocabulary (OOV) words are words that do not occur while training the data and are not
present in the model’s vocabulary.
Dr D Paul Joseph 66
▪ In FastText, each word is represented as the average of the vector representation of
its character n-grams along with the word itself.

▪ Consider the word “equal” and n = 3, then the word will be represented by character
n-grams:

▪ < eq, equ, qua, ual, al > and < equal >

▪ the word embedding for the word ‘equal’ can be given as the sum of all vector
representations of all of its character n-gram and the word itself.

Dr D Paul Joseph 67
FastText - CBOW

▪ Ex: I want to learn FastText.


▪ Input: “I,” “want,”
“to,” and “FastText”.
▪ Output: the model
predicts “learn” as output.
▪ All the input and output data are
in the same dimension and have
one-hot encoding. It uses a
neural network for training.
Dr D Paul Joseph 68
FastText - Skipgram

▪ Ex: I want to learn FastText.

▪ Input: “lea”, “arn”, “learn”

▪ Output: the model


predicts “I want to fast
text” as output.

Dr D Paul Joseph 69
Word2vec vs Fasttext
▪ Word2Vec works on the word level, while fastText works on the character n-grams.

▪ Word2Vec cannot provide embeddings for out-of-vocabulary words, while fastText


can provide embeddings for OOV words.

▪ FastText can provide better embeddings for morphologically rich languages


compared to word2vec.

▪ FastText uses the hierarchical classifier to train the model; hence it is faster than
word2vec.
Dr D Paul Joseph 70
Thank you

Dr D Paul Joseph,
Asst Prof, Sr Gr-I,
Department of Network and Security,
School of Computer Science and Engineering,
VIT-Amaravathi
[email protected]

You might also like