0% found this document useful (0 votes)
20 views

Introduction To NLP

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction To NLP

Uploaded by

Habib Mrad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Habib Mrad – Introduction to NLP

NLP Part from week 7

NLP = Natural Language Processing: dealing with textual data

Field of study focused on making sense of language

Uses computers to work with natural language data

Automated way to process data & organize information

Challenges of Natural Language

1. Lexical Ambiguity – Words have multiple meanings (ex Bass: fish or Bass: musical instrument)

2. Syntactic Ambiguity – Sentence is having multiple parse trees. Therefore, resulting in different
meanings

The professor said on Monday he would give an exam.

The government asks us to save soap and waste paper.

3. Semantic Ambiguity – Sentence having multiple possible meanings

Call me a cab (order a Taxi).

You are a cab

4. Anaphoric Ambiguity - Phrase or word which is previously mentioned but has a different
meaning.

Parsing is the process of breaking down a sentence into its elements so that the sentence can be
understood. Traditional parsing is done by hand, sometimes using sentence diagrams. Parsing is also
involved in more complex forms of analysis such as discourse analysis and psycholinguistics.

1
Habib Mrad – Introduction to NLP

Semantic vs syntactic ambiguity example of differentiation:

Semantic ambiguity: I saw her duck: has 2 meanings depending on the meaning of the word duck

If it is the verb duck, the sentence means the girl lower the head

If I is talking about the animal duck, means I saw her pet animal

So the meaning of the whole sentence depends on the meaning of the word duck

Syntactic ambiguity: The chicken is ready to eat

Applications of NLP
 Part-Of-speech (POS) tagging: a way to let computer understand text, it to assign a tag to each
word in the sentence: verb, noun, adjective...Etc

Break down the sentence into words and assign a tag to each word

POS can solve the syntactic ambiguity

2
Habib Mrad – Introduction to NLP

 Another application of NLP is Name Entity Recognition (NER): Extracting specific predefined
entities from text such as place, names dates, locations…etc.

We can train a custom NER model to extract custom entities

Sentiment analysis: extraction or assigning emotions or sentiments to a sentence. Applied on tweets,


customer reviews for movies or other products (happy or angry about product)

Text translation

3
Habib Mrad – Introduction to NLP

Visual Question Answering: combine CV and NLP. Train AI model to look at an image and the answer
question that we ask. The model has to understand the content of the image and then understand the
question asked before it can answer

Image captioning: combine CV and NLP. You show the model an image and the model has to describe
the image with words or generate the sentence that describes the image

4
Habib Mrad – Introduction to NLP

Automatic Handwriting Generation: train a model to generate handwritten text that look realistic

5
Habib Mrad – Introduction to NLP

Data Preparation:
1. Text cleaning
 Remove punctuation and special characters (@!.,><-+#$%^) as these don’t hold any extra
information
 Remove numbers and emojis or replaced by something more meaningful
 Removing leading and trailing whitespaces : take space and doesn’t have any meaningful
information
 Remove stopwords (of, at, by, for, with, etc..) : they don’t hold information and are frequently
repeated
 Normalizing case (Apple vs apple) : transforming to lowercase
 Remove HTML/XML tags : if data is collected with web scraping
 Replacing accented characters (such as é)
 Correcting spelling errors : when multiple words are similar which can confuse the model

2. Tokenization
 Turning a string or document into tokens (smaller chunks of text).
 Important Step in preparing text for NLP
 Different approaches:
 Split by Whitespace
 Split by word
 Custom regex (regular expression)

3. Stemming & Lemmatization


 Shorten words to their root stems. The main points is to reduce the complexity and reducing the
number of unique words that the model has to learn while maintaining the overall meaning or
information.

6
Habib Mrad – Introduction to NLP

Natural Language Toolkit (NLTK):


 Python library toolkit written for working and modeling NL text called NLTK.
 It provides good tools and functions for loading and cleaning text
 Usually used to get data ready for Machine Learning and Deep Learning algorithms.

7
Habib Mrad – Introduction to NLP

Text Cleaning
This collab will show you how to clean your data in preparation for the NLP. In the
first section, we will be using built-in python functions as for the second, we will
introduce NLTK libraries.

Split by white space


Splitting a document or text by word. Applying split() with no input parameters calls
the function to split text by looking at white spaces only. "Who's" for example is
considered an entire word.

text = 'Albert Einstein is widely celebrated as one of the most brilliant


scientists who’s ever lived. His fame was due to his original and creative
theories that at first seemed crazy, but that later turned out to represe
nt the actual physical world. Nonetheless, when he applied his theory of
general relativity to the universe as a whole in a paper published in 1917
, while serving as Director of the Kaiser Wilhelm Institute for Physics an
d professor at the University of Berlin, Einstein suggested the notion of
a "cosmological constant". He discarded this notion when it had been estab
lished that the universe was indeed expanding. His contributions to physic
s made it possible to envision how the universe evolved.'\
'In order to understand Einstein’s contribution to cosmology it is h
elpful to begin with his theory of gravity. Rather than thinking of gravit
y as an attractive force between two objects, in the tradition of Isaac Ne
wton, Einstein’s conception was that gravity is a property of massive obje
cts that “bends” space and time around itself. For example, consider the q
uestion of why the Moon does not fly off into space, rather than staying i
n orbit around Earth. Newton would say that gravity is a force acting betw
een the Earth and Moon, holding it in orbit. Einstein would say that the m
assive Earth “bends” space and time around itself, so that the moon follow
s the curves created by the massive Earth. His theory was confirmed when h
e predicted that even starlight would bend when passing near the sun durin
g a solar eclipse.'\
'In 1917 Einstein published a paper in which he applied this theory
to all matter in space. His theory led to the conclusion that all the mass
in the universe would bend space so much that it should have long ago con
tracted into a single dense blob. Given that the universe seems pretty wel
l spread out, however, and does not seem to be contracting, Einstein decid
ed to add a “fudge factor,” that acts like “anti-
gravity” and prevents the universe from collapsing. He called this idea, w
hich was represented as an additional term in the mathematical equation re
presenting his theory of gravity, the cosmological constant. In other word

8
Habib Mrad – Introduction to NLP

s, Einstein supposed the universe to be static and unchanging, because tha


t is the way it looked to astronomers in 1917.'

# split into words by white space


words = text.split() #automatically split text by white space
# and words is an array that contains all the different words spread by wh
ite space

print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who’s', 'ever', 'lived.',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy,', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world.', 'Nonetheless,', 'when', 'he', 'applied', 'his',
'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917,', 'while',
'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University',
'of', 'Berlin,', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '"cosmological', 'constant".', 'He', 'discarded', 'this', 'notion',
'when', 'it', 'had', 'been', 'established', 'that', 'the', 'universe']

Split by word
Using regular expression re and splitting based on words. Notice the difference in
"who's".

import re #python module for regular expression

Regulare expressions allow us to define a certain pattern and the extract


or replace a text that fits this pattern.

words = text.split() #automatically split text by white space

# split based on words only


words = re.split(r'\W+', text) # \W is a certain criteria which means spli
t based on words
# text = text we want to split

print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917', 'while', 'serving',
'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of',
'Berlin', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when',
'it', 'had', 'been', 'established', 'that', 'the']

9
Habib Mrad – Introduction to NLP

Normalizing case
Normalizing is when we turn all the words of the document into lower case. Careful
however not always employing this method because it might change the entire meaning.
For example take the French telecom company Orange and the fruit orange. Normalizing
would change the entire meaning.

#Normalizing means converting all the characters to lowercase

# split into words by white space


words = text.split() # seperate the whole sentence into chunks of words

# convert to lower case


words = [word.lower() for word in words] #looping over the words using thi
s list comprehension method
print(words[:100])

NLTK
The Natural Language Toolkit is a suite of libraries and programs for symbolic
and statistical natural language processing for English, written in the Python
programming language. (Wikipedia)

Split into sentences


This tokenizer divides a text into a list of sentences by using an unsupervised
algorithm to build a model for abbreviation words, collocations, and words that start
sentences. It must be trained on a large collection of plaintext in the target language
before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English. (nltk.org)

import nltk
from nltk import sent_tokenize # class responsible for tokenizing (splitti
ng) our text into sentences
nltk.download('punkt') #punkt tokenizer is an algorithm responsible for t
okezing the text into sentences

# split into sentences


sentences = sent_tokenize(text)

10
Habib Mrad – Introduction to NLP

for sentence in sentences:


print(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.
Albert Einstein is widely celebrated as one of the most brilliant scientists who’s ever lived.
His fame was due to his original and creative theories that at first seemed crazy, but that later turned out to represent
the actual physical world.
Nonetheless, when he applied his theory of general relativity to the universe as a whole in a paper published in 1917,
while serving as Director of the Kaiser Wilhelm Institute for Physics and professor at the University of Berlin,
Einstein suggested the notion of a "cosmological constant".
He discarded this notion when it had been established that the universe was indeed expanding.
His contributions to physics made it possible to envision how the universe evolved.In order to understand Einstein’s
contribution to cosmology it is helpful to begin with his theory of gravity.
Rather than thinking of gravity as an attractive force between two objects, in the tradition of Isaac Newton,
Einstein’s conception was that gravity is a property of massive objects that “bends” space and time around itself.
For example, consider the question of why the Moon does not fly off into space, rather than staying in orbit around
Earth.
Newton would say that gravity is a force acting between the Earth and Moon, holding it in orbit.
Einstein would say that the massive Earth “bends” space and time around itself, so that the moon follows the curves
created by the massive Earth.
His theory was confirmed when he predicted that even starlight would bend when passing near the sun during a solar
eclipse.In 1917 Einstein published a paper in which he applied this theory to all matter in space.
His theory led to the conclusion that all the mass in the universe would bend space so much that it should have long
ago contracted into a single dense blob.
Given that the universe seems pretty well spread out, however, and does not seem to be contracting, Einstein
decided to add a “fudge factor,” that acts like “anti-gravity” and prevents the universe from collapsing.
He called this idea, which was represented as an additional term in the mathematical equation representing his
theory of gravity, the cosmological constant.
In other words, Einstein supposed the universe to be static and unchanging, because that is the way it looked to
astronomers in 1917.

Split into words


From the same toolkit, we consider the tokenize library and import the word tokenizer.
Similary to re.split, this function will split the text into tokens rather than words.
Make sure you check out the output and spot the differences!

from nltk.tokenize import word_tokenize


# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', '’', 's', 'ever',
'lived', '.', 'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', ',',
'but', 'that', 'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', '.', 'Nonetheless', ',', 'when', 'he',
'applied', 'his', 'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in',
'1917', ',', 'while', 'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at',
'the', 'University', 'of', 'Berlin', ',', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '``', 'cosmological', 'constant', "''", '.',
'He']

11
Habib Mrad – Introduction to NLP

Filter out punctuation


Python includes the built-in function isalpha() that can be used in order to determine
whether or not the scanned word is alphabetical or else (numerical, punctuation, special
characters, etc.)

# split into words


tokens = word_tokenize(text)

# remove all tokens that are not alphabetic


words = [word for word in tokens if word.isalpha()]
# isalpha() return true if all the characters in a word is alphabetic , ot
herwise it returns false
print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', 'while', 'serving', 'as',
'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of', 'Berlin',
'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when', 'it', 'had',
'been', 'established', 'that', 'the', 'universe']

Remove stopwords
Stopwords are the words which do not add much meaning to a sentence. They can
safely be ignored without sacrificing the meaning of the sentence. The most common
are short function words such as the, is, at, which, and on, etc.
In this case, removing stopwords can cause problems when searching for phrases that
include them, particularly in names such as “The Who” or “Take That”.
Including the word "not" as a stopword also changes the entire meaning if removed (try
"this code is not good")

# let's list all the stopwords for NLTK


import nltk
from nltk.corpus import stopwords # importing list of all the stopwords fo
r NLTK

nltk.download('stopwords') # download the stopwords for different language


s

12
Habib Mrad – Introduction to NLP

stop_words = stopwords.words('english') #extract from stopwords the only w


ords for the English language
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was',
'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.

As you can see, the stopwords are all lower case and don't have punctuation. If
we're to compare them with our tokens, we need to make sure that our text is prepared
the same way. (Our text is having the same format as stopwords. So we have to clean
the text before we can compare it).

This cell recaps all what we have previously learnt in this colab: tokenizing, lower
casing and checking for alphabetic words.

# clean our text

# split into words


tokens = word_tokenize(text)

# convert to lower case


tokens = [w.lower() for w in tokens]

# remove all tokens that are not alphabetic


words = [word for word in tokens if word.isalpha()]

# filter out stop words


stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

['albert', 'einstein', 'widely', 'celebrated', 'one', 'brilliant', 'scientists', 'ever', 'lived', 'fame', 'due', 'original', 'creative',
'theories', 'first', 'seemed', 'crazy', 'later', 'turned', 'represent', 'actual', 'physical', 'world', 'nonetheless', 'applied',
'theory', 'general', 'relativity', 'universe', 'whole', 'paper', 'published', 'serving', 'director', 'kaiser', 'wilhelm', 'institute',

13
Habib Mrad – Introduction to NLP

'physics', 'professor', 'university', 'berlin', 'einstein', 'suggested', 'notion', 'cosmological', 'constant', 'discarded',
'notion', 'established', 'universe', 'indeed', 'expanding', 'contributions', 'physics', 'made', 'possible', 'envision',
'universe', 'order', 'understand', 'einstein', 'contribution', 'cosmology', 'helpful', 'begin', 'theory', 'gravity', 'rather',
'thinking', 'gravity', 'attractive', 'force', 'two', 'objects', 'tradition', 'isaac', 'newton', 'einstein', 'conception', 'gravity',
'property', 'massive', 'objects', 'bends', 'space', 'time', 'around', 'example', 'consider', 'question', 'moon', 'fly', 'space',
'rather', 'staying', 'orbit', 'around', 'earth', 'newton', 'would']

Stem words
Stemming refers to the process of reducing each word to its root or base form.
There are two types of stemmers for suffix stripping: porter and lancaster and each
has its own algorithm and sometimes display different outputs.

from nltk.tokenize import word_tokenize


from nltk.stem.porter import PorterStemmer #algorithm responsible for stem
ming the words

# split into words


tokens = word_tokenize(text)

# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['albert', 'einstein', 'is', 'wide', 'celebr', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientist', 'who', '’', 's', 'ever', 'live', '.', 'hi',
'fame', 'wa', 'due', 'to', 'hi', 'origin', 'and', 'creativ', 'theori', 'that', 'at', 'first', 'seem', 'crazi', ',', 'but', 'that', 'later', 'turn',
'out', 'to', 'repres', 'the', 'actual', 'physic', 'world', '.', 'nonetheless', ',', 'when', 'he', 'appli', 'hi', 'theori', 'of', 'gener', 'rel',
'to', 'the', 'univers', 'as', 'a', 'whole', 'in', 'a', 'paper', 'publish', 'in', '1917', ',', 'while', 'serv', 'as', 'director', 'of', 'the',
'kaiser', 'wilhelm', 'institut', 'for', 'physic', 'and', 'professor', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the',
'notion', 'of', 'a', '``', 'cosmolog', 'constant', "''", '.', 'he']

from nltk.tokenize import word_tokenize


from nltk.stem.lancaster import LancasterStemmer #other stemmer algorithm
we can use also (similar functionality as previous one)

# split into words


tokens = word_tokenize(text)

# stemming of words
lancaster = LancasterStemmer()
stemmed = [lancaster.stem(word) for word in tokens]
print(stemmed[:100])

14
Habib Mrad – Introduction to NLP

['albert', 'einstein', 'is', 'wid', 'celebr', 'as', 'on', 'of', 'the', 'most', 'bril', 'sci', 'who', '’', 's', 'ev', 'liv', '.', 'his', 'fam', 'was',
'due', 'to', 'his', 'origin', 'and', 'cre', 'the', 'that', 'at', 'first', 'seem', 'crazy', ',', 'but', 'that', 'lat', 'turn', 'out', 'to', 'repres',
'the', 'act', 'phys', 'world', '.', 'nonetheless', ',', 'when', 'he', 'apply', 'his', 'the', 'of', 'gen', 'rel', 'to', 'the', 'univers', 'as', 'a',
'whol', 'in', 'a', 'pap', 'publ', 'in', '1917', ',', 'whil', 'serv', 'as', 'direct', 'of', 'the', 'kais', 'wilhelm', 'institut', 'for', 'phys',
'and', 'profess', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the', 'not', 'of', 'a', '``', 'cosmolog', 'const', "''", '.',
'he']

Feature Extraction using Bag of Words

ML only deals with numbers, so we need to convert text into numbers.


Bag of Words is an approach that does exactly that

Bags of words model:


 Need to convert text to numbers
 Generate a fixed-length vector of numbers that represents the input sentence
 Collect unique words and Assign a unique number for each word
 We are only concerned with what words are present and not the order in which they
are present
 This method is called BoW because any information about the order or structure of the
words in the document is discarded. The model is only concerned with whether known
words occur in the documents, not where in the document

15
Habib Mrad – Introduction to NLP

For ex:
We have 3 sentences in our text Corpus.

 The first step in BoW is to collect each unique word and give it a certain ID or a number:
this is called the word_index
 Next step we create a vector of size 5 because our word_index is made of 5 words. And
place 1 whenever the word exists in the sentence and zero if doesn’t exist

Instead of placing 1 whenever the word exists, there are other methods for scoring words. For
example, we can put the count, how many times the word occurred OR the frequency how
frequent this word appear in the text Corpus

16
Habib Mrad – Introduction to NLP

Word Counts:
A simple way to tokenize text and build vocabulary of known words (sometimes called the
word_index).

 Create an instance of the CountVectorizer class.


 Call the fit() function to learn a vocabulary from documents.
 Call the transform() function on documents to encode each as a numerical vector.

Word Frequencies with TF-IDF


Generate scores to highlight words that are more interesting: frequent in a document but not
across all documents.

 Term Frequency (TF): how often a given word appears within a document.
 Inverse Document Frequency (IDF): how rare the word is across documents.

Word Frequencies with TF-IDF Example

17
Habib Mrad – Introduction to NLP

Using scikit-learn for feature extraction. It tokenize the text and builds a vocabulary of words
using the DF-IDF approach

 Create an instance of the TfidfVectorizer class


 Call the fit() function to learn a vocabulary from documents (text corpus).
 Call the transform() function on documents to encode each as a numerical vector.

18
Habib Mrad – Introduction to NLP

BoW is a useful feature extraction method, but has its limitations

Limitations of BoW
 Vocabulary: it requires generating a long vocabulary of long words in the corpus. So,
applying different text cleaning steps will results in different words and therefore
different vocabulary. The vocab requires careful design, specifically to manage the size,
which impacts the sparsity of the document representations.

 Sparsity: Also having a large vocabulary of words will result in having a large sparse
vector representation that is usually filled with zeros and few ones. So Sparse
representations are harder to model both for computational reasons and also for
information reasons, where the challenge is for the models to harness so little
information in such a large representational space. (harder to train the model, so also
harder to achieve good results)

 No order in the words:


The food was good, not bad at all
The food was bad, not good at all

The 2 sentences are opposed in meaning but their BoWs representations result
in the same vector since BoW doesn’t care about the order of the words.

 Meaning: Discarding word order ignores the context of the sentence, and in turn lose
the meaning of words in the document (semantics). Context and meaning can offer a lot
to the model and if this information is modeled, it could tell the difference between the
same words that are differently arranged

Word embeddings are ALTERNATIVE of feature extraction

19
Habib Mrad – Introduction to NLP

Bag Of Words (BOW)

This notebook shows the most basic applications of the different functions used
to create bag of words for texts

Word Counts
In this cell, we can see how the application of the fit() and transform() functions enabled
us to create a vocabulary of 8 words for the document. The vectorizer fit allowed
us to build the vocab and the transform encoded it into number of appearances of
each word. The indexing is done alphabetically.

from sklearn.feature_extraction.text import CountVectorizer # to look at t


he word counts using CountVectorizer

# list of text documents


text = ["The quick brown fox jumped over the lazy dog."]

# create the transform


vectorizer = CountVectorizer()

# tokenize and build vocab


vectorizer.fit(text) # this vecctorizer will go over the text and build th
e word vocab : wich means it look at all the unique words
# and then give each word a certain ID or an index number

# summarize
print(vectorizer.vocabulary_) #print the vocabulary of this vectorizer

# encode document
vector = vectorizer.transform(text) # transform the text: convert the word
s into numbers based on how they appeared in the sentence.
# we get the vector which is the encoded document of the text
# summarize encoded vector

print(vector.shape) # print the shape of the vector


print(vector.toarray()) # print the values of the vector

# {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'la


zy': 4, 'dog': 1}: this os the Vocabulary of the Corpus: list of
# the unique words we have in our courpus and each word has a certain ID.
These IDs were given based on alphabetical order of the words

20
Habib Mrad – Introduction to NLP

# (1, 8): size of the vector with 1 row of size 8 (bcz we have 8 unique wo
rds in our words vocabulary)

# [[1 1 1 1 1 1 1 2]]: value of the vector. Lets take the first item '1'
having an index 0, in this case 'brown'. So, how many times do we have 'br
own' in our text?
# we have 1 time the word 'Brown', so we put 1

# This is how we are converting our text into numbers by counting the numb
er of times each word appears in the text

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5,


'lazy': 4, 'dog': 1}
(1, 8)
[[1 1 1 1 1 1 1 2]]

Let's test on another example


In this example, we used the same vocabulary we built in the previous cell held in
vectorizer but on a different text. Obviously the resulting count is different since the
word 'brown' for example does not appear in this document. It is given the value 0,
whereas the word 'quick' appears 3 times.

# in this example we are NOT going to fit or leran a new vocab, we are jus
t going to transform this text using the already fitted vectorizer

text = ["The quick quick quick fox jumped over a big dog"]

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector


print(vector.shape) # (1, 8) : still having 1 row or one vector of size 8
bcz the vocab has not changed. we are still using a vocab size 8
print(vector.toarray()) # [[0 1 1 1 0 1 3 1]] : the values of the vector i
s now changed

# this is how we are coverting text into number using the count vectorizer
class. any words doent exist in our vocab simply is not counted

(1, 8)
[[0 1 1 1 0 1 3 1]]

21
Habib Mrad – Introduction to NLP

Word frequencies
This section deals with the application of the TF-IDF formula in order to describe word
appearances relative to multiple documents

Source: https://medium.com/nlpgurukool/tfidf-vectorizer-5421f1528402
This is another approach to vectorize or convert our text into numbers

from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents : this is doesnt work on only ONE document. we ha


ve at least 2 or 3 documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab


vectorizer.fit(text) # fit here is using to build the vocab

# summarize
print(vectorizer.vocabulary_) # print the vocab
print(vectorizer.idf_) # print the IDs

# encode document
vector = vectorizer.transform([text[0]]) # transform the original text int
o TF-IDF representation of this text using the transform function

# summarize encoded vector


print(vector.shape)
print(vector.toarray())

# {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'la


zy': 4, 'dog': 1} : The vocab

# [1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718


# 1.69314718 1. ] : tjis is the number associated with every word.
this number was calculated using the TFIDF formula
# 'brown': 0, this word is the first word in our index having TFIDF = 1.69
314718, 'dog': 1 has TFIDF=1.28768207 and so one so forth

22
Habib Mrad – Introduction to NLP

# After transforming (vectorizer.transform([text[0]])) outr first vector (


"The quick brown fox jumped over the lazy dog." ) here we got the a vecto
r of size 8
# [[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
# 0.36388646 0.42983441]] and these are the values of the vector after t
ransormation

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5,


'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
1.69314718 1. ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
0.36388646 0.42983441]]

You can tell how, for example, the word 'brown' has a higher multiplier than 'the' even
though the latter has two appearances in the first document. This is due to the fact 'the'
appears in the other two documents which reduces its uniqueness

Let's test it on one other example

text = ["The quick quick quick fox jumped over a big dog"]

# encode document
vector = vectorizer.transform(text) # vectorize this text using the vector
izer

# summarize encoded vector


print(vector.shape)
print(vector.toarray())

# The first item 'brown has a value 0 because this text down not countain
the word 'brown'
# The word 'quick' which hving index of 6 has the highest value here 0.848
33724 because it was p[resent more than one. So the term frequency is ver
y high

We find ourselves with two null values because this document does not include the
words 'brown' or 'lazy'. And since the word 'big' is not included in the vocab, it is discarded.

23
Habib Mrad – Introduction to NLP

sentiment_analysis.ipynb

Text classification
Text classification is one of the important tasks of text mining

In this notebook, we will perform Sentiment Analysis on IMDB movies reviews.


Sentiment Analysis is the art of extracting people's opinion from digital text. We will use
a regression model from Scikit-Learn able to predict the sentiment given a movie
review.
We will use the IMDB movie review dataset, which consists of 50,000 movies review
(50% are positive, 50% are negative).

The libraries needed in this exercise are:

 Numpy — a package for scientific computing.


 Pandas — a library providing high-performance, easy-to-use data structures and
data analysis tools for the Python
 Matplotlib — a package for plotting & visualizations.
 scikit-learn — a tool for data mining and data analysis.
 NLTK — a platform to work with natural language.

24
Habib Mrad – Introduction to NLP

Loading the data

Importing the libraries and necessary dictionaries


import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
from tensorflow import keras

# download Punkt Sentence Tokenizer


nltk.download('punkt')
# download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True

Loading the dataset in our directory


# download IMDB dataset
!wget "https://raw.githubusercontent.com/javaidnabi31/Word-Embeddding-
Sentiment-Classification/master/movie_data.csv" -O "movie_data.csv"
# wget is a command to parse from URL and save the file on "movie_data.csv
"
# -
o is used to specify the output name of the file after downloading the dat
aset

# list files in current directory


!ls -lah

--2022-09-16 10:20:36--
https://raw.githubusercontent.com/javaidnabi31/Word-Embeddding-Sentiment-
Classification/master/movie_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)...
185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/plain]
Saving to: ‘movie_data.csv’

25
Habib Mrad – Introduction to NLP

movie_data.csv 100%[===================>] 62.81M 316MB/s in


0.2s

2022-09-16 10:20:38 (316 MB/s) - ‘movie_data.csv’ saved


[65862309/65862309]

total 63M
drwxr-xr-x 1 root root 4.0K Sep 16 08:30 .
drwxr-xr-x 1 root root 4.0K Sep 16 08:27 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-r--r-- 1 root root 63M Sep 16 10:20 movie_data.csv
drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data

Reading the dataset file and getting info on it


Using pandas to read the csv file and displaying the first 5 rows

# path to IMDB dataseet


data = "movie_data.csv" # path that points to the dataset

# read file (dataset) into our program using pandas


df = pd.read_csv(data)

# display first 5 rows


df.head()

Getting info on our dataset


df.info()
# Dtype = object which means a string

26
Habib Mrad – Introduction to NLP

# Dtype = int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 50000 non-null object
1 sentiment 50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.4+ KB

A balanced dataset in sentiment analysis is a dataset which holds an equal amount of positive
sentiment data and negative sentiment data, meaning 50% of the data is positive and 50% is
negative

# check if dataset is balanced (number of positive sentiment = number of n


egative sentiment)
# by plotting the different classes

df.sentiment.value_counts()

1 25000
0 25000
Name: sentiment, dtype: int64

df.sentiment.value_counts().plot(kind='bar')

27
Habib Mrad – Introduction to NLP

Text cleaning
print(df.review[10]) # print the 10th review (at index 10)

I loved this movie from beginning to end.I am a musician and i let drugs get in the way of my some of the things i
used to love(skateboarding,drawing) but my friends were always there for me.Music was like my rehab,life
support,and my drug.It changed my life.I can totally relate to this movie and i wish there was more i could say.This
movie left me speechless to be honest.I just saw it on the Ifc channel.I usually hate having satellite but this was a
perk of having satellite.The ifc channel shows some really great movies and without it I never would have found this
movie.Im not a big fan of the international films because i find that a lot of the don't do a very good job on
translating lines.I mean the obvious language barrier leaves you to just believe thats what they are saying but its not
that big of a deal i guess.I almost never got to see this AMAZING movie.Good thing i stayed up for it instead of
going to bed..well earlier than usual.lol.I hope you all enjoy the hell of this movie and Love this movie just as much
as i did.I wish i could type this all in caps but its again the rules i guess thats shouting but it would really show my
excitement for the film.I Give It Three Thumbs Way Up!<br /><br />This Movie Blew ME AWAY!

Let's define a function (clean_review) that would clean each movie review (sentence)

import re # regular expression used to remove the non alphabetic character


s
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

english_stopwords = stopwords.words('english') # to get the english stopwo


rds

28
Habib Mrad – Introduction to NLP

stemmer = PorterStemmer() # create an instance of PorterStemmer() class

# define cleaning function


def clean_review(text):
# convert text to lower case
text = text.lower() # text is certain review

# remove non alphabetic characters


text = re.sub(r'[^a-
z]', ' ', text ) # sub = substitute. We give re as a patterns to search fo
r. any charachter or or
# list of charachters that match this expression will be substitued or r
eplaced by another character
# r[^a-
z] remove any list of characters that are note numerical or non alphabeti
c character will be replace by ' ' an empty character

# stem words using the PorterStemmer algorithm


# tokenize sentences (before stemming the text, we should tokenize the t
ext so convert the sentence to an array of words)
words = word_tokenize(text) #create (or convert to) an array of words

# Porter Stemmer
stemmed = [stemmer.stem(word) for word in words]# loop over to array of
words using the list apprehension and then each time calling stemmer.stem
on each word
# stemmed is an array that contained the stemmed version of the words ar
ray

# reconstruct the text : rebuild this array of stemmed into basically a


full sentence
text = ' '.join(stemmed)

# remove stopwords

# loop over each word in the text and check if this word is part of the
stopwords : if it is NOTthen you have to add it to an array
# and finally we have to recombine the array into the full text

text = ' '.join([word for word in text.split() if word not in english_st


opwords]) # calling text.split because i manually joined the words by spa
ce character
# so am sure that when i split them by text, am not going to miss anythi
ng and not taking extra words or something different bcz

29
Habib Mrad – Introduction to NLP

# i originally use word_tokenize() to tokenize the sentence into words,


i stemmed each word and then i joined it with a a space character
# so am reconstructed the text

# this will give us the full array of the stemmed words

# since i need to have the full text NOT the array, i assign to text

return text

And try it out on an instance of the dataset

print(df['review'][1])
print(clean_review(df['review'][1]))

# compare the ouput word by word

Actor turned director Bill Paxton follows up his promising debut, the
Gothic-horror "Frailty", with this family friendly sports drama about the
1913 U.S. Open where a young American caddy rises from his humble
background to play against his Bristish idol in what was dubbed as "The
Greatest Game Ever Played." I'm no fan of golf, and these scrappy underdog
sports flicks are a dime a dozen (most recently done to grand effect with
"Miracle" and "Cinderella Man"), but some how this film was enthralling
all the same.<br /><br />The film starts with some creative opening
credits (imagine a Disneyfied version of the animated opening credits of
HBO's "Carnivale" and "Rome"), but lumbers along slowly for its first by-
the-numbers hour. Once the action moves to the U.S. Open things pick up
very well. Paxton does a nice job and shows a knack for effective
directorial flourishes (I loved the rain-soaked montage of the action on
day two of the open) that propel the plot further or add some unexpected
psychological depth to the proceedings. There's some compelling character
development when the British Harry Vardon is haunted by images of the
aristocrats in black suits and top hats who destroyed his family cottage
as a child to make way for a golf course. He also does a good job of
visually depicting what goes on in the players' heads under pressure.
Golf, a painfully boring sport, is brought vividly alive here. Credit
should also be given the set designers and costume department for creating
an engaging period-piece atmosphere of London and Boston at the beginning
of the twentieth century.<br /><br />You know how this is going to end not
only because it's based on a true story but also because films in this
genre follow the same template over and over, but Paxton puts on a better
than average show and perhaps indicates more talent behind the camera than
he ever had in front of it. Despite the formulaic nature, this is a nice
and easy film to root for that deserves to find an audience.

actor turn director bill paxton follow hi promis debut gothic horror
frailti thi famili friendli sport drama u open young american caddi rise
hi humbl background play hi bristish idol wa dub greatest game ever play

30
Habib Mrad – Introduction to NLP

fan golf scrappi underdog sport flick dime dozen recent done grand effect
miracl cinderella man thi film wa enthral br br film start creativ open
credit imagin disneyfi version anim open credit hbo carnival rome lumber
along slowli first number hour onc action move u open thing pick veri well
paxton doe nice job show knack effect directori flourish love rain soak
montag action day two open propel plot add unexpect psycholog depth
proceed compel charact develop british harri vardon haunt imag aristocrat
black suit top hat destroy hi famili cottag child make way golf cours also
doe good job visual depict goe player head pressur golf pain bore sport
brought vividli aliv credit also given set design costum depart creat
engag period piec atmospher london boston begin twentieth centuri br br
know thi go end onli becaus base true stori also becaus film thi genr
follow templat paxton put better averag show perhap indic talent behind
camera ever front despit formula natur thi nice easi film root deserv find
audienc

And now clean the entire dataset reviews

# apply to all dataset


df['clean_review'] = df['review'].apply(clean_review) # apply for each ite
m or review , we are applying the cleaner_review
df.head()

Split dataset for training and testing


We will split our data into two subsets: a 50% subset will be used for training the model
for prediction and the remaining 50% will be used for evaluating or testing its
performance. The random state ensures reproducibility of the results.

from sklearn.model_selection import train_test_split


X = df['clean_review'].values

31
Habib Mrad – Introduction to NLP

y = df['sentiment'].values

# Split data into 50% training & 50% test


# Use a random state of 42 for example to ensure having the same split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, r


andom_state=42)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# (25000,) (25000,): we have 25000 rows for the training input. Each input
is only one item because we have 1 string of words (1 review)
# and trthe second is 25000 rows and 1 integer which is the review

Feature extraction with Bag of Words


In this section, let's apply the Bag of Words method to learn the vocabulary of our
text and with it transform our training input data

# here we want to convert our reviews from being an array of words to arra
y of numbers

from sklearn.feature_extraction.text import CountVectorizer

# FILL BLANKS
# define a CountVectorizer (with binary=True and max_features=10000)
vectorizer = CountVectorizer(binary = True, max_features=10000)

# we want to use CountVectorizer not as Count Vectorizer but rather as a b


asic bag of words
# meaning that i dont care to put the count of the words (how many times t
he words appeared in the sentence),
# but it is just that if the word appeared at least once then just put 1 ,
otherwise put 0 : so we put binary = True
# max_features= to specify the max number of features bcz we are dealing w
ith a huge dataset
# so i dont care about making sure all the unique words are present in the
vocab, i care only about the top 10000 words
# so my vocab size is 10000 which is also each sentence or each review wil
l be converted into a vecto of size 10000

# learn the vocabulary of all tokens in our training dataset


vectorizer.fit(x_train)

32
Habib Mrad – Introduction to NLP

# transform x_train to bag of words


x_train_bow = vectorizer.transform(x_train) #convert the reviews to actual
vector of number
x_test_bow = vectorizer.transform(x_test)

print(x_train_bow.shape, y_train.shape)
print(x_test_bow.shape, y_test.shape)

# (25000, 10000) : its a vector that has 10000 numbers in it

(25000, 10000) (25000,)


(25000, 10000) (25000,)

Classification
Our data is ready for classification. Let's use LogisticRegression

from sklearn.linear_model import LogisticRegression

# define the LogisticRegression classifier


model = LogisticRegression()

# train the classifier on the training data


model.fit(x_train_bow, y_train)

# get the mean accuracy on the training data


acc_train = model.score(x_train_bow, y_train)

print('Training Accuracy:', acc_train)

Training Accuracy: 0.9812


/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to
converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

33
Habib Mrad – Introduction to NLP

Evaluating the performance of our model through its accuracy score

# Evaluate model with test data


acc_test = model.score(x_test_bow, y_test)
print('Test Accuracy:', acc_test)

Test Accuracy: 0.86624

Let's use the model to predict!


To do so, let's create a predict function which takes as argument our model and the
bag of words vectorizer together with a review on which it would predict the
sentiment.
This review should be cleaned with the clean_review function we built, transformed
by bag of words and then used for prediction with model.predict().

# define predict function


def predict(model, vectorizer, review):
review = clean_review(review) # cleaning
review_bow = vectorizer.transform([review]) # convert it from text to
numbers
return model.predict(review_bow)[0] # return the first prediction

# we have to make sure that our input review has to pass through the s
ame steps
# to be cleaned, preprocessed the same way before it can be fed to the
model
# review = actual review to predict on

And let's try it out on an example

review = 'The movie was great!'


predict(model, vectorizer, review)

34
Habib Mrad – Introduction to NLP

The Word Embedding Model

There is a better way to encode text in NN that addresses the limitations of BoW approach.

 A learned representation for text where words that have the same meaning have
a similar representation.
 Considered one of the breakthroughs of deep learning on NLP problems.
Because it was the first approach that takes the meaning of the words into
consideration

35
Habib Mrad – Introduction to NLP

Word Embeddings - How to use them?


 Compute similarities between words
 Create groups of related words
 Use the features in text classification
 Enrich information about lesser known words by leveraging of more highly used
words:

- Part-of-Speech tagging
- Named-Entity recognition

36
Habib Mrad – Introduction to NLP

Word2Vec
 One of the earliest and most famous word-embedding model, which convert
words into vectors.
 It is unsupervised learning approach where the model is given a text corpus or a
series of documents and the model learns an embedding for the words in this
Corpus
 Statistical method for efficiently learning a standalone word embedding from a
text corpus.
 Developed by Tomas Mikolov, et al. at Google in 2013
 Has become the de facto standard for developing pre-trained word embedding.
 The cool property about word2vec is how similar words have similar vectors
 . so we can draw inference for different words based on Vector similarity

37
Habib Mrad – Introduction to NLP

How Word2vec works?

 Uses small neural networks to calculate word embeddings based on words’ local
context (window of neighboring words). There are 2 different approaches for
word2vec models:

- Continuous Bag-of-Words (CBOW)


- Continuous Skip-Gram model

CBOW works by taking a window of 5 words, hiding the word in the middle, and then
teaching a model to predict this word based on the 4 neighboring words. The CBOW
model learns the embeddings by predicting the current word based on its context

The continuous skip-Gram model works by taking a word and learning how to predict
the four neighboring words based on that word. The continuous skip-Gram model learns
by predicting the surrounding words based on the current word.

38
Habib Mrad – Introduction to NLP

Global Vectors for Word Representation –


GloVe
 Extension to the Word2Vec method learning word vectors.
 Developed by Pennington, et al. at Stanford.
 Problem? Word2Vec ignores statistical properties of the corpus (whether some
context words appear more often than others do – the counting words
appearances or frequencies).
 Word2vec purely generates the model based on where the word appear in the
sentence. For that reason, GloVe was developed to introduce the frequencies of
the co-occurrence to the word2vec, since it holds vital information and should be
integrated into the model. Co-occurrence is when we consider all the unique
words of the corpus and we count the appearances of two unique words 2 by 2
next to each other which holds extra information about the meaning
 GloVe argues that the frequency of co-occurrences is vital information
 GloVe builds word embeddings in a way that a combination of word vectors
relates directly to the probability of these words’ co-occurrence in the corpus.
 In a practical sense, it resembles building a table with columns of all the unique
words, and rows also all unique words and the for each pair, counts that

39
Habib Mrad – Introduction to NLP

appearances next to each other. This expands the ability of the model to
separates words from each other and their meanings.

Gensim
 Open source Python library for natural language processing
 It is billed as “topic modeling for humans”
 It is not a generic NLP framework like NLTK, but rather, a mature, focused, and
efficient suite of NLP tools for topic modeling. It is sometimes called topic
modeling for humans
 Topic modelling involves extracting topic or the content of a certain text. It asks
the question: what are we talking about here?
 It supports an implementation of Word2V model. So it is useful when we want to
train a custom word embedding model or load the pre-trained model for
inference.

40
Habib Mrad – Introduction to NLP

Word Embeddings
Objective: The goal from this exercise is to explore the Word2Vec technique for word
embeddings and introduce Stanford's GloVe embedding as well. The libraries we will
be using are Gensim for Word Embeddings Word2Vec and GloVe, matplotlib for
visualization and Scikit-Learn for Principle Component Analysis models, which
are used for reducing dimensionality.

Learn Word2Vec Embedding using Gensim


Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. However, we
will demonstrate the principles using a small in-memory example of text.
Each sentence must be tokenized (divided into words and prepared). The sentences
could be text loaded into memory, or an iterator that progressively loads text, required
for very large text corpora.
There are many parameters on this constructor:

 size: (default 100) The number of dimensions of the embedding, e.g. the length of the
dense vector to represent each token (word).
 window: (default 5) The maximum distance between a target word and words around the
target word.
 min_count: (default 5) The minimum count of words to consider when training the
model; words with an occurrence less than this count will be ignored.
 workers: (default 3) The number of threads to use while training.
 sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

Building and training a Word2Vec model


from gensim.models import Word2Vec

# define training data (this is a dummy training data)


sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec']
,
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]

# Here the data is splitted into tokens,


# So it is an array of arrays and each array is basically the words of the
sentence

41
Habib Mrad – Introduction to NLP

# We really want to find any word that occurs more than 1 time (min_count=
1). So count all the words that they appear at least one time

# train model
model = Word2Vec(sentences, min_count=1)

# summarize the loaded model


print(model)

# summarize vocabulary
words = list(model.wv.vocab) # model.wv.vocab this is the vocab of the mod
el and we want to convert them to a list
print(words)

# access vector for one word


print(model['sentence']) # this means we want to convert this word into th
e vector form of this word

# save model
model.save('model.bin')

# size=100 each vector will have a size of 100


# ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second',
'yet', 'another', 'one', 'more', 'and', 'final'] : these are the unique wo
rds that we
# have in our vocab

Results:

WARNING:gensim.models.base_any2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for


smoother alpha decay
Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[-7.8457932e-04 3.5207625e-03 -4.2921999e-03 -4.3479460e-03
-4.9049719e-03 -4.2559318e-03 -9.3232520e-04 -3.2275093e-03
5.5988797e-04 -4.5542549e-03 3.4206519e-03 1.3965422e-03
-3.0884927e-03 3.7948424e-03 1.6137204e-03 -4.8686615e-03
1.8643612e-03 -3.2971515e-03 -4.9116723e-03 -3.9235079e-03
-3.7943881e-03 -1.8185180e-03 -3.2840617e-05 4.1652853e-03
-2.0585104e-04 3.0013716e-03 -4.8512327e-03 -5.7977054e-04
3.7493408e-03 -6.9524965e-04 5.2066760e-05 4.2560236e-03
-3.0693677e-03 4.4368575e-03 4.2182300e-03 -1.3236658e-03
-3.0931591e-03 8.3119166e-04 -3.0872936e-03 -1.5930691e-03
-1.5032124e-03 1.9341486e-03 1.0838168e-03 4.0697749e-03
2.8024155e-03 1.7420420e-03 -2.8390333e-03 3.5742642e-03
-2.3106801e-04 2.6949684e-03 -3.6318353e-03 -2.8590560e-03
4.0416783e-03 1.3401215e-03 2.1875310e-03 5.5839861e-04

42
Habib Mrad – Introduction to NLP

3.3788255e-04 4.7156317e-03 2.6413812e-03 -4.4250677e-04


4.1823206e-04 3.6568106e-03 -1.5885477e-03 1.1498865e-03
1.2212624e-03 -3.6188874e-03 1.3244674e-03 2.0759690e-03
-3.3118036e-03 4.2094686e-03 3.4420323e-03 -2.4229037e-03
-2.6206372e-03 2.2858661e-03 9.1098453e-04 8.4567037e-05
5.7552656e-04 -6.4006272e-05 -2.7839965e-03 9.6004776e-04
1.5203831e-03 3.9642006e-03 1.4925852e-03 3.4692041e-03
6.2210555e-04 1.1770288e-03 -4.4826525e-03 -7.5982476e-04
-1.2346446e-03 -3.8285055e-03 -2.2624165e-03 3.1580434e-03
-4.1306936e-03 -4.6889810e-03 -2.6903450e-03 4.7526191e-04
-3.4166775e-03 2.9652417e-03 -4.9173511e-03 -1.1885230e-03]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:25: DeprecationWarning: Call to deprecated
`__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

# let's load the binary model and test it

# load model
new_model = Word2Vec.load('model.bin')
print(new_model['this', 'is'])
[[-2.6341430e-03 -6.8210840e-04 -2.2032256e-03 -2.9977076e-04
4.1053989e-03 -1.6542217e-03 -4.9963145e-04 1.7543014e-03
-1.2514311e-04 -4.7425432e-03 1.7437695e-03 2.6504786e-03
-9.4006537e-04 -1.4253150e-03 3.6435311e-03 -2.3904638e-03
-4.4562640e-03 -4.9993531e-03 -2.2515173e-03 2.3276228e-03
-3.3021998e-03 4.6024695e-03 3.0749412e-03 -4.2429226e-03
-1.5931168e-03 3.8181676e-03 2.4235758e-03 -3.4062283e-03
3.3555010e-03 1.2844945e-03 -2.7234193e-03 -8.0988108e-04
1.6813108e-03 -9.1353070e-04 4.4700159e-03 1.0303746e-03
2.9608165e-03 4.2942618e-03 -3.7393915e-03 -3.5502152e-03
-4.6064076e-03 -4.6758484e-03 4.1863527e-03 3.1615091e-03
4.5192647e-03 -9.1852510e-04 2.6027197e-03 -4.1375658e-03
-3.5489022e-03 -2.2965863e-03 7.5722620e-04 6.1148533e-04
2.1013829e-03 -3.9978596e-04 9.7986590e-04 4.9094297e-03
-2.1577110e-03 4.8989342e-03 3.8428246e-03 2.7198202e-03
-1.4638512e-03 -1.8885430e-03 -2.5638286e-04 -2.0531581e-03
-1.6875562e-03 -2.7395764e-03 5.5740983e-04 2.5572355e-03
5.6512235e-04 -2.9321280e-03 -7.5157406e-04 -1.2344553e-04
1.1710543e-03 -1.1974664e-03 -1.2816156e-03 -4.1503790e-03
4.5915483e-03 -9.5228269e-04 -3.4181602e-04 -2.3537560e-03
-2.6966461e-03 7.6452579e-04 -1.2867393e-03 1.2928311e-03
-3.9543938e-03 -4.6212925e-03 4.1194628e-03 4.6026148e-03
4.3687961e-04 -2.1079446e-03 4.3081054e-03 4.9309372e-03
-2.4515828e-03 -1.7825521e-04 2.4474061e-03 -2.4826333e-03
-4.6479427e-03 4.2938837e-03 -1.1659318e-04 -1.8699563e-03]
[-5.1402091e-04 4.0105209e-03 -6.2171964e-04 2.5609015e-03
-1.8050724e-03 4.9377461e-03 3.2770697e-03 1.8770240e-03
-4.9929242e-03 3.2758827e-03 -1.6969994e-03 -2.6876389e-03
2.9338847e-03 -4.3162694e-03 -4.2533325e-03 4.6241065e-03
-1.0224945e-03 2.1736498e-03 4.4645616e-03 -1.1554005e-03
-3.8527905e-03 -6.9526068e-05 2.0650621e-04 4.6414724e-03
1.1177687e-03 7.0009363e-04 3.9822836e-03 7.9727673e-05
-2.5550727e-04 1.5229684e-03 -3.3755524e-03 4.8815636e-03

43
Habib Mrad – Introduction to NLP

-4.6662665e-03 4.2167231e-03 -3.0159575e-03 -3.6175633e-03


4.4732318e-05 -1.4307767e-03 8.6314336e-04 -2.6571582e-04
2.5633357e-03 -3.5801379e-04 -1.2866167e-03 2.1925850e-03
-4.2773047e-03 -4.9695778e-03 -1.3883609e-03 3.1987922e-03
1.7983645e-03 4.0425011e-03 4.5442330e-03 -1.6893109e-03
-4.9699568e-03 4.2594168e-03 -1.2876982e-03 3.6424017e-03
2.7398604e-03 4.0277229e-03 -2.7221404e-03 4.6593808e-03
2.5286388e-03 -7.1398198e-04 3.0460334e-03 3.1234655e-03
-4.6948571e-04 -4.9914969e-03 2.2993404e-03 3.1571195e-03
1.5602089e-03 -3.4938734e-03 1.7282860e-03 7.2801049e-04
-3.7645237e-03 -1.3775865e-03 5.9858215e-04 8.6858316e-05
2.0116186e-03 3.1379755e-03 -4.1431260e-05 2.1408075e-03
2.7511080e-03 -3.4430167e-03 -1.6758190e-03 -4.2654286e-04
-3.7440318e-03 2.3888228e-03 -3.1758812e-03 1.1063067e-03
-3.5542385e-03 4.4913641e-03 -2.4453797e-03 -4.1569071e-03
4.7013820e-03 4.1635367e-03 1.6596923e-03 3.6739753e-04
-4.3959259e-03 -1.1767038e-03 -4.7597084e-03 3.0412525e-03]]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: DeprecationWarning: Call to deprecated
`__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
"""

Visualize Word Embedding


After learning the word embedding for the text, it is nice to explore it with visualization.
We can use classical projection methods to reduce the high-dimensional word
vectors to two- dimensional plots and plot them on a graph. The visualizations can
provide a qualitative diagnostic for our learned model.

from gensim.models import Word2Vec


from sklearn.decomposition import PCA
from matplotlib import pyplot

# define training data


sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec']
,
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]

# train model
model = Word2Vec(sentences, min_count=1)

# fit a 2D PCA model to the vectors


X = model[model.wv.vocab] # the vocab will contains all the vectors for al
l the words in our vocab
pca = PCA(n_components=2) #reduce dimensionality to 2D
result = pca.fit_transform(X) #2D model to plot

44
Habib Mrad – Introduction to NLP

# create a scatter plot of the projection


# pull out the 2 dimensions as x and y
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab) # visulaize the dots with the words

# annotate the points on the graph with the words themselves


for i, word in enumerate(words):
pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) # print the first
and the second dimension
pyplot.show()

Results:

WARNING:gensim.models.base_any2vec:under 10 jobs per worker: consider


setting a smaller `batch_words' for smoother alpha decay
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16:
DeprecationWarning: Call to deprecated `__getitem__` (Method will be
removed in 4.0.0, use self.wv.__getitem__() instead).

app.launch_new_instance()

45
Habib Mrad – Introduction to NLP

Google Word2Vec
Instead of training your own word vectors (which requires a lot of RAM and compute
power), you can simply use a pre-trained word embedding. Google has published a pre-
trained Word2Vec model that was trained on Google news data (about 100 billion
words). It contains 3 million words and phrases and was fit using 300-dimensional
word vectors. It is a 1.53 Gigabyte file.

import gensim.downloader as api

model = api.load("word2vec-google-news-300")

# We can now see that we have a file Googlenews-vectors-


negative300.bin binary file including the weights of the model

[==================================================] 100.0%
1662.8/1662.8MB downloaded

# from gensim.models import keyedvectors

# # Lets load the google word2vec model

# filename = 'Googlenews-vectors-negative300.bin'
# model = KeyedVectors.load_word2vec_format(filename, binary=True)
# model = keyedvectors.load_word2vec_format(filename, binary=True)

46
Habib Mrad – Introduction to NLP

Let's have fun


# get word vector
model['car']

array([ 0.13085938, 0.00842285, 0.03344727, -0.05883789, 0.04003906, -


0.14257812, 0.04931641, -0.16894531, 0.20898438, 0.11962891, 0.18066406, -
0.25 , -0.10400391, -0.10742188, -0.01879883, 0.05200195, -0.00216675,
0.06445312, 0.14453125, -0.04541016, 0.16113281, -0.01611328, -0.03088379,
0.08447266, 0.16210938, 0.04467773, -0.15527344, 0.25390625, 0.33984375,
0.00756836, -0.25585938, -0.01733398, -0.03295898, 0.16308594, -
0.12597656, -0.09912109, 0.16503906, 0.06884766, -0.18945312, 0.02832031,
-0.0534668 , -0.03063965, 0.11083984, 0.24121094, -0.234375 , 0.12353516,
-0.00294495, 0.1484375 , 0.33203125, 0.05249023, -0.20019531, 0.37695312,
0.12255859, 0.11425781, -0.17675781, 0.10009766, 0.0030365 , 0.26757812,
0.20117188, 0.03710938, 0.11083984, -0.09814453, -0.3125 , 0.03515625,
0.02832031, 0.26171875, -0.08642578, -0.02258301, -0.05834961, -
0.00787354, 0.11767578, -0.04296875, -0.17285156, 0.04394531, -0.23046875,
0.1640625 , -0.11474609, -0.06030273, 0.01196289, -0.24707031, 0.32617188,
-0.04492188, -0.11425781, 0.22851562, -0.01647949, -0.15039062, -
0.13183594, 0.12597656, -0.17480469, 0.02209473, -0.1015625 , 0.00817871,
0.10791016, -0.24609375, -0.109375 , -0.09375 , -0.01623535, -0.20214844,
0.23144531, -0.05444336, -0.05541992, -0.20898438, 0.26757812, 0.27929688,
0.17089844, -0.17578125, -0.02770996, -0.20410156, 0.02392578, 0.03125 , -
0.25390625, -0.125 , -0.05493164, -0.17382812, 0.28515625, -0.23242188,
0.0234375 , -0.20117188, -0.13476562, 0.26367188, 0.00769043, 0.20507812,
-0.01708984, -0.12988281, 0.04711914, 0.22070312, 0.02099609, -0.29101562,
-0.02893066, 0.17285156, 0.04272461, -0.19824219, -0.04003906, -
0.16992188, 0.10058594, -0.09326172, 0.15820312, -0.16503906, -0.06054688,
0.19433594, -0.07080078, -0.06884766, -0.09619141, -0.07226562,
0.04882812, 0.07324219, 0.11035156, 0.04858398, -0.17675781, -0.33789062,
0.22558594, 0.16308594, 0.05102539, -0.08251953, 0.07958984, 0.08740234, -
0.16894531, -0.02160645, -0.19238281, 0.03857422, -0.05102539, 0.21972656,
0.08007812, -0.21191406, -0.07519531, -0.15039062, 0.3046875 , -
0.17089844, 0.12353516, -0.234375 , -0.10742188, -0.06787109, 0.01904297,
-0.14160156, -0.22753906, -0.16308594, 0.14453125, -0.15136719, -0.296875
, 0.22363281, -0.10205078, -0.0456543 , -0.21679688, -0.09033203, 0.09375
, -0.15332031, -0.01550293, 0.3046875 , -0.23730469, 0.08935547,
0.03710938, 0.02941895, -0.28515625, 0.15820312, -0.00306702, 0.06054688,
0.00497437, -0.15234375, -0.00836182, 0.02197266, -0.12109375, -
0.13867188, -0.2734375 , -0.06835938, 0.08251953, -0.26367188, -
0.16992188, 0.14746094, 0.08496094, 0.02075195, 0.13671875, -0.04931641, -
0.0100708 , -0.00369263, -0.10839844, 0.14746094, -0.15527344, 0.16113281,
0.05615234, -0.05004883, -0.1640625 , -0.26953125, 0.4140625 , 0.06079102,
-0.046875 , -0.02514648, 0.10595703, 0.1328125 , -0.16699219, -0.04907227,
0.04663086, 0.05151367, -0.07958984, -0.16503906, -0.29882812, 0.06054688,
-0.15332031, -0.00598145, 0.06640625, -0.04516602, 0.24316406, -
0.07080078, -0.36914062, -0.23144531, -0.11914062, -0.08300781,
0.14746094, -0.05761719, 0.23535156, -0.12304688, 0.14648438, 0.13671875,
0.15429688, 0.02111816, -0.09570312, 0.05859375, 0.03979492, -0.08105469,
0.0559082 , -0.16601562, 0.27148438, -0.20117188, -0.00915527, 0.07324219,

47
Habib Mrad – Introduction to NLP

0.10449219, 0.34570312, -0.26367188, 0.02099609, -0.40039062, -0.03417969,


-0.15917969, -0.08789062, 0.08203125, 0.23339844, 0.0213623 , -0.11328125,
0.05249023, -0.10449219, -0.02380371, -0.08349609, -0.04003906,
0.01916504, -0.01226807, -0.18261719, -0.06787109, -0.08496094, -
0.03039551, -0.05395508, 0.04248047, 0.12792969, -0.27539062, 0.28515625,
-0.04736328, 0.06494141, -0.11230469, -0.02575684, -0.04125977,
0.22851562, -0.14941406, -0.15039062], dtype=float32)

# get most similar words


model.most_similar('yellow')

[('red', 0.751919150352478), ('bright_yellow', 0.6869138479232788),


('orange', 0.6421886682510376), ('blue', 0.6376121044158936), ('purple',
0.6272757053375244), ('yellows', 0.612633228302002), ('pink',
0.6098285913467407), ('bright_orange', 0.5974606871604919),
('Warplanes_streaked_overhead', 0.583052396774292), ('participant_LOGIN',
0.5816755294799805)]

# queen = (king - man) + woman


result = model.most_similar(positive=['woman', 'king'], negative=['man'],
topn=1)
print(result)

[('queen', 0.7118192911148071)]

# (france - paris) + spain = ?


result = model.most_similar(positive=["paris","spain"], negative=["france"
], topn=1)
print(result)

[('madrid', 0.5295541286468506)]

model.doesnt_match(["red", "blue", "car", "orange"])


# we give the model a list of words and it gives us the one that doesnt ma
tch the odd one
# the word 'car' has the lowest distance with other ones

/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py:895:
FutureWarning: arrays to stack must be passed as a "sequence" type such as
list or tuple. Support for non-sequence iterables such as generators is
deprecated as of NumPy 1.16 and will raise an error in the future.

48
Habib Mrad – Introduction to NLP

vectors = vstack(self.word_vec(word, use_norm=True) for word in


used_words).astype(REAL)
'
car

Stanford’s GloVe Embedding


Like Word2Vec, the GloVe researchers also provide pre-trained word vectors. Let's
download the smallest GloVe pre-trained model from the GloVe website. It is a 822
Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors)
trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.

# download
!wget http://nlp.stanford.edu/data/glove.6B.zip

# unzip downloaded word embeddings


!unzip glove.6B.zip

# list files in current directoty


!ls -lah

--2022-09-16 13:42:06-- http://nlp.stanford.edu/data/glove.6B.zip


Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80...
connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-09-16 13:42:06-- https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443...
connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
[following]
--2022-09-16 13:42:06--
https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)...
171.64.64.22
Connecting to downloads.cs.stanford.edu
(downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip 100%[===================>] 822.24M 5.01MB/s in 2m


39s

49
Habib Mrad – Introduction to NLP

2022-09-16 13:44:46 (5.16 MB/s) - ‘glove.6B.zip’ saved


[862182613/862182613]

Archive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
total 2.9G
drwxr-xr-x 1 root root 4.0K Sep 16 13:45 .
drwxr-xr-x 1 root root 4.0K Sep 16 12:50 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-rw-r-- 1 root root 332M Aug 4 2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root 662M Aug 4 2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root 990M Aug 27 2014 glove.6B.300d.txt
-rw-rw-r-- 1 root root 164M Aug 4 2014 glove.6B.50d.txt
-rw-r--r-- 1 root root 823M Oct 25 2015 glove.6B.zip
-rw-r--r-- 1 root root 22K Sep 16 13:03 model.bin

drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data

from gensim.models import KeyedVectors


from gensim.scripts.glove2word2vec import glove2word2vec

# convert the 100-


dimensional version of the glove model to word2vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# load the converted model


filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# binary=False because the File is not binary now. It is just a converted
text file
# Here the model named 'model' is loaded into memory

# calculate: (king - man) + woman = ?


result = model.most_similar(positive=['woman', 'king'], negative=['man'],
topn=1)
print(result)

[('queen', 0.7698541283607483)

50

You might also like