0% found this document useful (0 votes)

21 views

Introduction To NLP

Uploaded by

Habib Mrad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Introduction To NLP

Uploaded by

Habib Mrad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Habib Mrad – Introduction to NLP

NLP Part from week 7

NLP = Natural Language Processing: dealing with textual data

Field of study focused on making sense of language

Uses computers to work with natural language data

Automated way to process data & organize information

Challenges of Natural Language

1. Lexical Ambiguity – Words have multiple meanings (ex Bass: fish or Bass: musical instrument)

2. Syntactic Ambiguity – Sentence is having multiple parse trees. Therefore, resulting in different
meanings

The professor said on Monday he would give an exam.

The government asks us to save soap and waste paper.

3. Semantic Ambiguity – Sentence having multiple possible meanings

Call me a cab (order a Taxi).

You are a cab

4. Anaphoric Ambiguity - Phrase or word which is previously mentioned but has a different
meaning.

Parsing is the process of breaking down a sentence into its elements so that the sentence can be
understood. Traditional parsing is done by hand, sometimes using sentence diagrams. Parsing is also
involved in more complex forms of analysis such as discourse analysis and psycholinguistics.

1
Habib Mrad – Introduction to NLP

Semantic vs syntactic ambiguity example of differentiation:

Semantic ambiguity: I saw her duck: has 2 meanings depending on the meaning of the word duck

If it is the verb duck, the sentence means the girl lower the head

If I is talking about the animal duck, means I saw her pet animal

So the meaning of the whole sentence depends on the meaning of the word duck

Syntactic ambiguity: The chicken is ready to eat

Applications of NLP
 Part-Of-speech (POS) tagging: a way to let computer understand text, it to assign a tag to each
word in the sentence: verb, noun, adjective...Etc

Break down the sentence into words and assign a tag to each word

POS can solve the syntactic ambiguity

2
Habib Mrad – Introduction to NLP

 Another application of NLP is Name Entity Recognition (NER): Extracting specific predefined
entities from text such as place, names dates, locations…etc.

We can train a custom NER model to extract custom entities

Sentiment analysis: extraction or assigning emotions or sentiments to a sentence. Applied on tweets,

customer reviews for movies or other products (happy or angry about product)

Text translation

3
Habib Mrad – Introduction to NLP

Visual Question Answering: combine CV and NLP. Train AI model to look at an image and the answer
question that we ask. The model has to understand the content of the image and then understand the
question asked before it can answer

Image captioning: combine CV and NLP. You show the model an image and the model has to describe
the image with words or generate the sentence that describes the image

4
Habib Mrad – Introduction to NLP

Automatic Handwriting Generation: train a model to generate handwritten text that look realistic

5
Habib Mrad – Introduction to NLP

Data Preparation:
1. Text cleaning
 Remove punctuation and special characters (@!.,><-+#$%^) as these don’t hold any extra
information
 Remove numbers and emojis or replaced by something more meaningful
 Removing leading and trailing whitespaces : take space and doesn’t have any meaningful
information
 Remove stopwords (of, at, by, for, with, etc..) : they don’t hold information and are frequently
repeated
 Normalizing case (Apple vs apple) : transforming to lowercase
 Remove HTML/XML tags : if data is collected with web scraping
 Replacing accented characters (such as é)
 Correcting spelling errors : when multiple words are similar which can confuse the model

2. Tokenization
 Turning a string or document into tokens (smaller chunks of text).
 Important Step in preparing text for NLP
 Different approaches:
 Split by Whitespace
 Split by word
 Custom regex (regular expression)

3. Stemming & Lemmatization

 Shorten words to their root stems. The main points is to reduce the complexity and reducing the
number of unique words that the model has to learn while maintaining the overall meaning or
information.

6
Habib Mrad – Introduction to NLP

Natural Language Toolkit (NLTK):

 Python library toolkit written for working and modeling NL text called NLTK.
 It provides good tools and functions for loading and cleaning text
 Usually used to get data ready for Machine Learning and Deep Learning algorithms.

7
Habib Mrad – Introduction to NLP

Text Cleaning
This collab will show you how to clean your data in preparation for the NLP. In the
first section, we will be using built-in python functions as for the second, we will
introduce NLTK libraries.

Split by white space

Splitting a document or text by word. Applying split() with no input parameters calls
the function to split text by looking at white spaces only. "Who's" for example is
considered an entire word.

text = 'Albert Einstein is widely celebrated as one of the most brilliant

scientists who’s ever lived. His fame was due to his original and creative
theories that at first seemed crazy, but that later turned out to represe
nt the actual physical world. Nonetheless, when he applied his theory of
general relativity to the universe as a whole in a paper published in 1917
, while serving as Director of the Kaiser Wilhelm Institute for Physics an
d professor at the University of Berlin, Einstein suggested the notion of
a "cosmological constant". He discarded this notion when it had been estab
lished that the universe was indeed expanding. His contributions to physic
s made it possible to envision how the universe evolved.'\
'In order to understand Einstein’s contribution to cosmology it is h
elpful to begin with his theory of gravity. Rather than thinking of gravit
y as an attractive force between two objects, in the tradition of Isaac Ne
wton, Einstein’s conception was that gravity is a property of massive obje
cts that “bends” space and time around itself. For example, consider the q
uestion of why the Moon does not fly off into space, rather than staying i
n orbit around Earth. Newton would say that gravity is a force acting betw
een the Earth and Moon, holding it in orbit. Einstein would say that the m
assive Earth “bends” space and time around itself, so that the moon follow
s the curves created by the massive Earth. His theory was confirmed when h
e predicted that even starlight would bend when passing near the sun durin
g a solar eclipse.'\
'In 1917 Einstein published a paper in which he applied this theory
to all matter in space. His theory led to the conclusion that all the mass
in the universe would bend space so much that it should have long ago con
tracted into a single dense blob. Given that the universe seems pretty wel
l spread out, however, and does not seem to be contracting, Einstein decid
ed to add a “fudge factor,” that acts like “anti-
gravity” and prevents the universe from collapsing. He called this idea, w
hich was represented as an additional term in the mathematical equation re
presenting his theory of gravity, the cosmological constant. In other word

8
Habib Mrad – Introduction to NLP

s, Einstein supposed the universe to be static and unchanging, because tha

t is the way it looked to astronomers in 1917.'

# split into words by white space

words = text.split() #automatically split text by white space
# and words is an array that contains all the different words spread by wh
ite space

print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who’s', 'ever', 'lived.',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy,', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world.', 'Nonetheless,', 'when', 'he', 'applied', 'his',
'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917,', 'while',
'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University',
'of', 'Berlin,', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '"cosmological', 'constant".', 'He', 'discarded', 'this', 'notion',
'when', 'it', 'had', 'been', 'established', 'that', 'the', 'universe']

Split by word
Using regular expression re and splitting based on words. Notice the difference in
"who's".

import re #python module for regular expression

Regulare expressions allow us to define a certain pattern and the extract

or replace a text that fits this pattern.

words = text.split() #automatically split text by white space

# split based on words only

words = re.split(r'\W+', text) # \W is a certain criteria which means spli
t based on words
# text = text we want to split

print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917', 'while', 'serving',
'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of',
'Berlin', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when',
'it', 'had', 'been', 'established', 'that', 'the']

9
Habib Mrad – Introduction to NLP

Normalizing case
Normalizing is when we turn all the words of the document into lower case. Careful
however not always employing this method because it might change the entire meaning.
For example take the French telecom company Orange and the fruit orange. Normalizing
would change the entire meaning.

#Normalizing means converting all the characters to lowercase

# split into words by white space

words = text.split() # seperate the whole sentence into chunks of words

# convert to lower case

words = [word.lower() for word in words] #looping over the words using thi
s list comprehension method
print(words[:100])

NLTK
The Natural Language Toolkit is a suite of libraries and programs for symbolic
and statistical natural language processing for English, written in the Python
programming language. (Wikipedia)

Split into sentences

This tokenizer divides a text into a list of sentences by using an unsupervised
algorithm to build a model for abbreviation words, collocations, and words that start
sentences. It must be trained on a large collection of plaintext in the target language
before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English. (nltk.org)

import nltk
from nltk import sent_tokenize # class responsible for tokenizing (splitti
ng) our text into sentences
nltk.download('punkt') #punkt tokenizer is an algorithm responsible for t
okezing the text into sentences

# split into sentences

sentences = sent_tokenize(text)

10
Habib Mrad – Introduction to NLP

for sentence in sentences:

print(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.
Albert Einstein is widely celebrated as one of the most brilliant scientists who’s ever lived.
His fame was due to his original and creative theories that at first seemed crazy, but that later turned out to represent
the actual physical world.
Nonetheless, when he applied his theory of general relativity to the universe as a whole in a paper published in 1917,
while serving as Director of the Kaiser Wilhelm Institute for Physics and professor at the University of Berlin,
Einstein suggested the notion of a "cosmological constant".
He discarded this notion when it had been established that the universe was indeed expanding.
His contributions to physics made it possible to envision how the universe evolved.In order to understand Einstein’s
contribution to cosmology it is helpful to begin with his theory of gravity.
Rather than thinking of gravity as an attractive force between two objects, in the tradition of Isaac Newton,
Einstein’s conception was that gravity is a property of massive objects that “bends” space and time around itself.
For example, consider the question of why the Moon does not fly off into space, rather than staying in orbit around
Earth.
Newton would say that gravity is a force acting between the Earth and Moon, holding it in orbit.
Einstein would say that the massive Earth “bends” space and time around itself, so that the moon follows the curves
created by the massive Earth.
His theory was confirmed when he predicted that even starlight would bend when passing near the sun during a solar
eclipse.In 1917 Einstein published a paper in which he applied this theory to all matter in space.
His theory led to the conclusion that all the mass in the universe would bend space so much that it should have long
ago contracted into a single dense blob.
Given that the universe seems pretty well spread out, however, and does not seem to be contracting, Einstein
decided to add a “fudge factor,” that acts like “anti-gravity” and prevents the universe from collapsing.
He called this idea, which was represented as an additional term in the mathematical equation representing his
theory of gravity, the cosmological constant.
In other words, Einstein supposed the universe to be static and unchanging, because that is the way it looked to
astronomers in 1917.

Split into words

From the same toolkit, we consider the tokenize library and import the word tokenizer.
Similary to re.split, this function will split the text into tokens rather than words.
Make sure you check out the output and spot the differences!

from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)
print(tokens[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', '’', 's', 'ever',
'lived', '.', 'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', ',',
'but', 'that', 'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', '.', 'Nonetheless', ',', 'when', 'he',
'applied', 'his', 'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in',
'1917', ',', 'while', 'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at',
'the', 'University', 'of', 'Berlin', ',', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '``', 'cosmological', 'constant', "''", '.',
'He']

11
Habib Mrad – Introduction to NLP

Filter out punctuation

Python includes the built-in function isalpha() that can be used in order to determine
whether or not the scanned word is alphabetical or else (numerical, punctuation, special
characters, etc.)

# split into words

tokens = word_tokenize(text)

# remove all tokens that are not alphabetic

words = [word for word in tokens if word.isalpha()]
# isalpha() return true if all the characters in a word is alphabetic , ot
herwise it returns false
print(words[:100])

['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', 'while', 'serving', 'as',
'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of', 'Berlin',
'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when', 'it', 'had',
'been', 'established', 'that', 'the', 'universe']

Remove stopwords
Stopwords are the words which do not add much meaning to a sentence. They can
safely be ignored without sacrificing the meaning of the sentence. The most common
are short function words such as the, is, at, which, and on, etc.
In this case, removing stopwords can cause problems when searching for phrases that
include them, particularly in names such as “The Who” or “Take That”.
Including the word "not" as a stopword also changes the entire meaning if removed (try
"this code is not good")

# let's list all the stopwords for NLTK

import nltk
from nltk.corpus import stopwords # importing list of all the stopwords fo
r NLTK

nltk.download('stopwords') # download the stopwords for different language

12
Habib Mrad – Introduction to NLP

stop_words = stopwords.words('english') #extract from stopwords the only w

ords for the English language
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was',
'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.

As you can see, the stopwords are all lower case and don't have punctuation. If
we're to compare them with our tokens, we need to make sure that our text is prepared
the same way. (Our text is having the same format as stopwords. So we have to clean
the text before we can compare it).

This cell recaps all what we have previously learnt in this colab: tokenizing, lower
casing and checking for alphabetic words.

# clean our text

# split into words

tokens = word_tokenize(text)

# convert to lower case

tokens = [w.lower() for w in tokens]

# remove all tokens that are not alphabetic

words = [word for word in tokens if word.isalpha()]

# filter out stop words

stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

['albert', 'einstein', 'widely', 'celebrated', 'one', 'brilliant', 'scientists', 'ever', 'lived', 'fame', 'due', 'original', 'creative',
'theories', 'first', 'seemed', 'crazy', 'later', 'turned', 'represent', 'actual', 'physical', 'world', 'nonetheless', 'applied',
'theory', 'general', 'relativity', 'universe', 'whole', 'paper', 'published', 'serving', 'director', 'kaiser', 'wilhelm', 'institute',

13
Habib Mrad – Introduction to NLP

'physics', 'professor', 'university', 'berlin', 'einstein', 'suggested', 'notion', 'cosmological', 'constant', 'discarded',
'notion', 'established', 'universe', 'indeed', 'expanding', 'contributions', 'physics', 'made', 'possible', 'envision',
'universe', 'order', 'understand', 'einstein', 'contribution', 'cosmology', 'helpful', 'begin', 'theory', 'gravity', 'rather',
'thinking', 'gravity', 'attractive', 'force', 'two', 'objects', 'tradition', 'isaac', 'newton', 'einstein', 'conception', 'gravity',
'property', 'massive', 'objects', 'bends', 'space', 'time', 'around', 'example', 'consider', 'question', 'moon', 'fly', 'space',
'rather', 'staying', 'orbit', 'around', 'earth', 'newton', 'would']

Stem words
Stemming refers to the process of reducing each word to its root or base form.
There are two types of stemmers for suffix stripping: porter and lancaster and each
has its own algorithm and sometimes display different outputs.

from nltk.tokenize import word_tokenize

from nltk.stem.porter import PorterStemmer #algorithm responsible for stem
ming the words

# split into words

tokens = word_tokenize(text)

# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['albert', 'einstein', 'is', 'wide', 'celebr', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientist', 'who', '’', 's', 'ever', 'live', '.', 'hi',
'fame', 'wa', 'due', 'to', 'hi', 'origin', 'and', 'creativ', 'theori', 'that', 'at', 'first', 'seem', 'crazi', ',', 'but', 'that', 'later', 'turn',
'out', 'to', 'repres', 'the', 'actual', 'physic', 'world', '.', 'nonetheless', ',', 'when', 'he', 'appli', 'hi', 'theori', 'of', 'gener', 'rel',
'to', 'the', 'univers', 'as', 'a', 'whole', 'in', 'a', 'paper', 'publish', 'in', '1917', ',', 'while', 'serv', 'as', 'director', 'of', 'the',
'kaiser', 'wilhelm', 'institut', 'for', 'physic', 'and', 'professor', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the',
'notion', 'of', 'a', '``', 'cosmolog', 'constant', "''", '.', 'he']

from nltk.tokenize import word_tokenize

from nltk.stem.lancaster import LancasterStemmer #other stemmer algorithm
we can use also (similar functionality as previous one)

# split into words

tokens = word_tokenize(text)

# stemming of words
lancaster = LancasterStemmer()
stemmed = [lancaster.stem(word) for word in tokens]
print(stemmed[:100])

14
Habib Mrad – Introduction to NLP

['albert', 'einstein', 'is', 'wid', 'celebr', 'as', 'on', 'of', 'the', 'most', 'bril', 'sci', 'who', '’', 's', 'ev', 'liv', '.', 'his', 'fam', 'was',
'due', 'to', 'his', 'origin', 'and', 'cre', 'the', 'that', 'at', 'first', 'seem', 'crazy', ',', 'but', 'that', 'lat', 'turn', 'out', 'to', 'repres',
'the', 'act', 'phys', 'world', '.', 'nonetheless', ',', 'when', 'he', 'apply', 'his', 'the', 'of', 'gen', 'rel', 'to', 'the', 'univers', 'as', 'a',
'whol', 'in', 'a', 'pap', 'publ', 'in', '1917', ',', 'whil', 'serv', 'as', 'direct', 'of', 'the', 'kais', 'wilhelm', 'institut', 'for', 'phys',
'and', 'profess', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the', 'not', 'of', 'a', '``', 'cosmolog', 'const', "''", '.',
'he']

Feature Extraction using Bag of Words

ML only deals with numbers, so we need to convert text into numbers.

Bag of Words is an approach that does exactly that

Bags of words model:

 Need to convert text to numbers
 Generate a fixed-length vector of numbers that represents the input sentence
 Collect unique words and Assign a unique number for each word
 We are only concerned with what words are present and not the order in which they
are present
 This method is called BoW because any information about the order or structure of the
words in the document is discarded. The model is only concerned with whether known
words occur in the documents, not where in the document

15
Habib Mrad – Introduction to NLP

For ex:
We have 3 sentences in our text Corpus.

 The first step in BoW is to collect each unique word and give it a certain ID or a number:
this is called the word_index
 Next step we create a vector of size 5 because our word_index is made of 5 words. And
place 1 whenever the word exists in the sentence and zero if doesn’t exist

Instead of placing 1 whenever the word exists, there are other methods for scoring words. For
example, we can put the count, how many times the word occurred OR the frequency how
frequent this word appear in the text Corpus

16
Habib Mrad – Introduction to NLP

Word Counts:
A simple way to tokenize text and build vocabulary of known words (sometimes called the
word_index).

 Create an instance of the CountVectorizer class.

 Call the fit() function to learn a vocabulary from documents.
 Call the transform() function on documents to encode each as a numerical vector.

Word Frequencies with TF-IDF

Generate scores to highlight words that are more interesting: frequent in a document but not
across all documents.

 Term Frequency (TF): how often a given word appears within a document.
 Inverse Document Frequency (IDF): how rare the word is across documents.

Word Frequencies with TF-IDF Example

17
Habib Mrad – Introduction to NLP

Using scikit-learn for feature extraction. It tokenize the text and builds a vocabulary of words
using the DF-IDF approach

 Create an instance of the TfidfVectorizer class

 Call the fit() function to learn a vocabulary from documents (text corpus).
 Call the transform() function on documents to encode each as a numerical vector.

18
Habib Mrad – Introduction to NLP

BoW is a useful feature extraction method, but has its limitations

Limitations of BoW
 Vocabulary: it requires generating a long vocabulary of long words in the corpus. So,
applying different text cleaning steps will results in different words and therefore
different vocabulary. The vocab requires careful design, specifically to manage the size,
which impacts the sparsity of the document representations.

 Sparsity: Also having a large vocabulary of words will result in having a large sparse
vector representation that is usually filled with zeros and few ones. So Sparse
representations are harder to model both for computational reasons and also for
information reasons, where the challenge is for the models to harness so little
information in such a large representational space. (harder to train the model, so also
harder to achieve good results)

 No order in the words:

The food was good, not bad at all
The food was bad, not good at all

The 2 sentences are opposed in meaning but their BoWs representations result
in the same vector since BoW doesn’t care about the order of the words.

 Meaning: Discarding word order ignores the context of the sentence, and in turn lose
the meaning of words in the document (semantics). Context and meaning can offer a lot
to the model and if this information is modeled, it could tell the difference between the
same words that are differently arranged

Word embeddings are ALTERNATIVE of feature extraction

19
Habib Mrad – Introduction to NLP

Bag Of Words (BOW)

This notebook shows the most basic applications of the different functions used
to create bag of words for texts

Word Counts
In this cell, we can see how the application of the fit() and transform() functions enabled
us to create a vocabulary of 8 words for the document. The vectorizer fit allowed
us to build the vocab and the transform encoded it into number of appearances of
each word. The indexing is done alphabetically.

from sklearn.feature_extraction.text import CountVectorizer # to look at t

he word counts using CountVectorizer

# list of text documents

text = ["The quick brown fox jumped over the lazy dog."]

# create the transform

vectorizer = CountVectorizer()

# tokenize and build vocab

vectorizer.fit(text) # this vecctorizer will go over the text and build th
e word vocab : wich means it look at all the unique words
# and then give each word a certain ID or an index number

# summarize
print(vectorizer.vocabulary_) #print the vocabulary of this vectorizer

# encode document
vector = vectorizer.transform(text) # transform the text: convert the word
s into numbers based on how they appeared in the sentence.
# we get the vector which is the encoded document of the text
# summarize encoded vector

print(vector.shape) # print the shape of the vector

print(vector.toarray()) # print the values of the vector

# {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'la

zy': 4, 'dog': 1}: this os the Vocabulary of the Corpus: list of
# the unique words we have in our courpus and each word has a certain ID.
These IDs were given based on alphabetical order of the words

20
Habib Mrad – Introduction to NLP

# (1, 8): size of the vector with 1 row of size 8 (bcz we have 8 unique wo
rds in our words vocabulary)

# [[1 1 1 1 1 1 1 2]]: value of the vector. Lets take the first item '1'
having an index 0, in this case 'brown'. So, how many times do we have 'br
own' in our text?
# we have 1 time the word 'Brown', so we put 1

# This is how we are converting our text into numbers by counting the numb
er of times each word appears in the text

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5,

'lazy': 4, 'dog': 1}
(1, 8)
[[1 1 1 1 1 1 1 2]]

Let's test on another example

In this example, we used the same vocabulary we built in the previous cell held in
vectorizer but on a different text. Obviously the resulting count is different since the
word 'brown' for example does not appear in this document. It is given the value 0,
whereas the word 'quick' appears 3 times.

# in this example we are NOT going to fit or leran a new vocab, we are jus
t going to transform this text using the already fitted vectorizer

text = ["The quick quick quick fox jumped over a big dog"]

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector

print(vector.shape) # (1, 8) : still having 1 row or one vector of size 8
bcz the vocab has not changed. we are still using a vocab size 8
print(vector.toarray()) # [[0 1 1 1 0 1 3 1]] : the values of the vector i
s now changed

# this is how we are coverting text into number using the count vectorizer
class. any words doent exist in our vocab simply is not counted

(1, 8)
[[0 1 1 1 0 1 3 1]]

21
Habib Mrad – Introduction to NLP

Word frequencies
This section deals with the application of the TF-IDF formula in order to describe word
appearances relative to multiple documents

Source: https://medium.com/nlpgurukool/tfidf-vectorizer-5421f1528402
This is another approach to vectorize or convert our text into numbers

from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents : this is doesnt work on only ONE document. we ha

ve at least 2 or 3 documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab

vectorizer.fit(text) # fit here is using to build the vocab

# summarize
print(vectorizer.vocabulary_) # print the vocab
print(vectorizer.idf_) # print the IDs

# encode document
vector = vectorizer.transform([text[0]]) # transform the original text int
o TF-IDF representation of this text using the transform function

# summarize encoded vector

print(vector.shape)
print(vector.toarray())

# {'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'la

zy': 4, 'dog': 1} : The vocab

# [1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718

# 1.69314718 1. ] : tjis is the number associated with every word.
this number was calculated using the TFIDF formula
# 'brown': 0, this word is the first word in our index having TFIDF = 1.69
314718, 'dog': 1 has TFIDF=1.28768207 and so one so forth

22
Habib Mrad – Introduction to NLP

# After transforming (vectorizer.transform([text[0]])) outr first vector (

"The quick brown fox jumped over the lazy dog." ) here we got the a vecto
r of size 8
# [[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
# 0.36388646 0.42983441]] and these are the values of the vector after t
ransormation

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5,

'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
1.69314718 1. ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
0.36388646 0.42983441]]

You can tell how, for example, the word 'brown' has a higher multiplier than 'the' even
though the latter has two appearances in the first document. This is due to the fact 'the'
appears in the other two documents which reduces its uniqueness

Let's test it on one other example

text = ["The quick quick quick fox jumped over a big dog"]

# encode document
vector = vectorizer.transform(text) # vectorize this text using the vector
izer

# summarize encoded vector

print(vector.shape)
print(vector.toarray())

# The first item 'brown has a value 0 because this text down not countain
the word 'brown'
# The word 'quick' which hving index of 6 has the highest value here 0.848
33724 because it was p[resent more than one. So the term frequency is ver
y high

We find ourselves with two null values because this document does not include the
words 'brown' or 'lazy'. And since the word 'big' is not included in the vocab, it is discarded.

23
Habib Mrad – Introduction to NLP

sentiment_analysis.ipynb

Text classification
Text classification is one of the important tasks of text mining

In this notebook, we will perform Sentiment Analysis on IMDB movies reviews.

Sentiment Analysis is the art of extracting people's opinion from digital text. We will use
a regression model from Scikit-Learn able to predict the sentiment given a movie
review.
We will use the IMDB movie review dataset, which consists of 50,000 movies review
(50% are positive, 50% are negative).

The libraries needed in this exercise are:

 Numpy — a package for scientific computing.

 Pandas — a library providing high-performance, easy-to-use data structures and
data analysis tools for the Python
 Matplotlib — a package for plotting & visualizations.
 scikit-learn — a tool for data mining and data analysis.
 NLTK — a platform to work with natural language.

24
Habib Mrad – Introduction to NLP

Loading the data

Importing the libraries and necessary dictionaries

import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
from tensorflow import keras

# download Punkt Sentence Tokenizer

nltk.download('punkt')
# download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True

Loading the dataset in our directory

# download IMDB dataset
!wget "https://raw.githubusercontent.com/javaidnabi31/Word-Embeddding-
Sentiment-Classification/master/movie_data.csv" -O "movie_data.csv"
# wget is a command to parse from URL and save the file on "movie_data.csv
"
# -
o is used to specify the output name of the file after downloading the dat
aset

# list files in current directory

!ls -lah

--2022-09-16 10:20:36--
https://raw.githubusercontent.com/javaidnabi31/Word-Embeddding-Sentiment-
Classification/master/movie_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)...
185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/plain]
Saving to: ‘movie_data.csv’

25
Habib Mrad – Introduction to NLP

movie_data.csv 100%[===================>] 62.81M 316MB/s in

0.2s

2022-09-16 10:20:38 (316 MB/s) - ‘movie_data.csv’ saved

[65862309/65862309]

total 63M
drwxr-xr-x 1 root root 4.0K Sep 16 08:30 .
drwxr-xr-x 1 root root 4.0K Sep 16 08:27 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-r--r-- 1 root root 63M Sep 16 10:20 movie_data.csv
drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data

Reading the dataset file and getting info on it

Using pandas to read the csv file and displaying the first 5 rows

# path to IMDB dataseet

data = "movie_data.csv" # path that points to the dataset

# read file (dataset) into our program using pandas

df = pd.read_csv(data)

# display first 5 rows

df.head()

Getting info on our dataset

df.info()
# Dtype = object which means a string

26
Habib Mrad – Introduction to NLP

# Dtype = int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 50000 non-null object
1 sentiment 50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.4+ KB

A balanced dataset in sentiment analysis is a dataset which holds an equal amount of positive
sentiment data and negative sentiment data, meaning 50% of the data is positive and 50% is
negative

# check if dataset is balanced (number of positive sentiment = number of n

egative sentiment)
# by plotting the different classes

df.sentiment.value_counts()

1 25000
0 25000
Name: sentiment, dtype: int64

df.sentiment.value_counts().plot(kind='bar')

27
Habib Mrad – Introduction to NLP

Text cleaning
print(df.review[10]) # print the 10th review (at index 10)

I loved this movie from beginning to end.I am a musician and i let drugs get in the way of my some of the things i
used to love(skateboarding,drawing) but my friends were always there for me.Music was like my rehab,life
support,and my drug.It changed my life.I can totally relate to this movie and i wish there was more i could say.This
movie left me speechless to be honest.I just saw it on the Ifc channel.I usually hate having satellite but this was a
perk of having satellite.The ifc channel shows some really great movies and without it I never would have found this
movie.Im not a big fan of the international films because i find that a lot of the don't do a very good job on
translating lines.I mean the obvious language barrier leaves you to just believe thats what they are saying but its not
that big of a deal i guess.I almost never got to see this AMAZING movie.Good thing i stayed up for it instead of
going to bed..well earlier than usual.lol.I hope you all enjoy the hell of this movie and Love this movie just as much
as i did.I wish i could type this all in caps but its again the rules i guess thats shouting but it would really show my
excitement for the film.I Give It Three Thumbs Way Up! This Movie Blew ME AWAY!

Let's define a function (clean_review) that would clean each movie review (sentence)

import re # regular expression used to remove the non alphabetic character

s
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

english_stopwords = stopwords.words('english') # to get the english stopwo

rds

28
Habib Mrad – Introduction to NLP

stemmer = PorterStemmer() # create an instance of PorterStemmer() class

# define cleaning function

def clean_review(text):
# convert text to lower case
text = text.lower() # text is certain review

# remove non alphabetic characters

text = re.sub(r'[^a-
z]', ' ', text ) # sub = substitute. We give re as a patterns to search fo
r. any charachter or or
# list of charachters that match this expression will be substitued or r
eplaced by another character
# r[^a-
z] remove any list of characters that are note numerical or non alphabeti
c character will be replace by ' ' an empty character

# stem words using the PorterStemmer algorithm

# tokenize sentences (before stemming the text, we should tokenize the t
ext so convert the sentence to an array of words)
words = word_tokenize(text) #create (or convert to) an array of words

# Porter Stemmer
stemmed = [stemmer.stem(word) for word in words]# loop over to array of
words using the list apprehension and then each time calling stemmer.stem
on each word
# stemmed is an array that contained the stemmed version of the words ar
ray

# reconstruct the text : rebuild this array of stemmed into basically a

full sentence
text = ' '.join(stemmed)

# remove stopwords

# loop over each word in the text and check if this word is part of the
stopwords : if it is NOTthen you have to add it to an array
# and finally we have to recombine the array into the full text

text = ' '.join([word for word in text.split() if word not in english_st

opwords]) # calling text.split because i manually joined the words by spa
ce character
# so am sure that when i split them by text, am not going to miss anythi
ng and not taking extra words or something different bcz

29
Habib Mrad – Introduction to NLP

# i originally use word_tokenize() to tokenize the sentence into words,

i stemmed each word and then i joined it with a a space character
# so am reconstructed the text

# this will give us the full array of the stemmed words

# since i need to have the full text NOT the array, i assign to text

return text

And try it out on an instance of the dataset

print(df['review'][1])
print(clean_review(df['review'][1]))

# compare the ouput word by word

Actor turned director Bill Paxton follows up his promising debut, the
Gothic-horror "Frailty", with this family friendly sports drama about the
1913 U.S. Open where a young American caddy rises from his humble
background to play against his Bristish idol in what was dubbed as "The
Greatest Game Ever Played." I'm no fan of golf, and these scrappy underdog
sports flicks are a dime a dozen (most recently done to grand effect with
"Miracle" and "Cinderella Man"), but some how this film was enthralling
all the same. The film starts with some creative opening
credits (imagine a Disneyfied version of the animated opening credits of
HBO's "Carnivale" and "Rome"), but lumbers along slowly for its first by-
the-numbers hour. Once the action moves to the U.S. Open things pick up
very well. Paxton does a nice job and shows a knack for effective
directorial flourishes (I loved the rain-soaked montage of the action on
day two of the open) that propel the plot further or add some unexpected
psychological depth to the proceedings. There's some compelling character
development when the British Harry Vardon is haunted by images of the
aristocrats in black suits and top hats who destroyed his family cottage
as a child to make way for a golf course. He also does a good job of
visually depicting what goes on in the players' heads under pressure.
Golf, a painfully boring sport, is brought vividly alive here. Credit
should also be given the set designers and costume department for creating
an engaging period-piece atmosphere of London and Boston at the beginning
of the twentieth century. You know how this is going to end not
only because it's based on a true story but also because films in this
genre follow the same template over and over, but Paxton puts on a better
than average show and perhaps indicates more talent behind the camera than
he ever had in front of it. Despite the formulaic nature, this is a nice
and easy film to root for that deserves to find an audience.

actor turn director bill paxton follow hi promis debut gothic horror
frailti thi famili friendli sport drama u open young american caddi rise
hi humbl background play hi bristish idol wa dub greatest game ever play

30
Habib Mrad – Introduction to NLP

fan golf scrappi underdog sport flick dime dozen recent done grand effect
miracl cinderella man thi film wa enthral br br film start creativ open
credit imagin disneyfi version anim open credit hbo carnival rome lumber
along slowli first number hour onc action move u open thing pick veri well
paxton doe nice job show knack effect directori flourish love rain soak
montag action day two open propel plot add unexpect psycholog depth
proceed compel charact develop british harri vardon haunt imag aristocrat
black suit top hat destroy hi famili cottag child make way golf cours also
doe good job visual depict goe player head pressur golf pain bore sport
brought vividli aliv credit also given set design costum depart creat
engag period piec atmospher london boston begin twentieth centuri br br
know thi go end onli becaus base true stori also becaus film thi genr
follow templat paxton put better averag show perhap indic talent behind
camera ever front despit formula natur thi nice easi film root deserv find
audienc

And now clean the entire dataset reviews

# apply to all dataset

df['clean_review'] = df['review'].apply(clean_review) # apply for each ite
m or review , we are applying the cleaner_review
df.head()

Split dataset for training and testing

We will split our data into two subsets: a 50% subset will be used for training the model
for prediction and the remaining 50% will be used for evaluating or testing its
performance. The random state ensures reproducibility of the results.

from sklearn.model_selection import train_test_split

X = df['clean_review'].values

31
Habib Mrad – Introduction to NLP

y = df['sentiment'].values

# Split data into 50% training & 50% test

# Use a random state of 42 for example to ensure having the same split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, r

andom_state=42)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

# (25000,) (25000,): we have 25000 rows for the training input. Each input
is only one item because we have 1 string of words (1 review)
# and trthe second is 25000 rows and 1 integer which is the review

Feature extraction with Bag of Words

In this section, let's apply the Bag of Words method to learn the vocabulary of our
text and with it transform our training input data

# here we want to convert our reviews from being an array of words to arra
y of numbers

from sklearn.feature_extraction.text import CountVectorizer

# FILL BLANKS
# define a CountVectorizer (with binary=True and max_features=10000)
vectorizer = CountVectorizer(binary = True, max_features=10000)

# we want to use CountVectorizer not as Count Vectorizer but rather as a b

asic bag of words
# meaning that i dont care to put the count of the words (how many times t
he words appeared in the sentence),
# but it is just that if the word appeared at least once then just put 1 ,
otherwise put 0 : so we put binary = True
# max_features= to specify the max number of features bcz we are dealing w
ith a huge dataset
# so i dont care about making sure all the unique words are present in the
vocab, i care only about the top 10000 words
# so my vocab size is 10000 which is also each sentence or each review wil
l be converted into a vecto of size 10000

# learn the vocabulary of all tokens in our training dataset

vectorizer.fit(x_train)

32
Habib Mrad – Introduction to NLP

# transform x_train to bag of words

x_train_bow = vectorizer.transform(x_train) #convert the reviews to actual
vector of number
x_test_bow = vectorizer.transform(x_test)

print(x_train_bow.shape, y_train.shape)
print(x_test_bow.shape, y_test.shape)

# (25000, 10000) : its a vector that has 10000 numbers in it

(25000, 10000) (25000,)

Classification
Our data is ready for classification. Let's use LogisticRegression

from sklearn.linear_model import LogisticRegression

# define the LogisticRegression classifier

model = LogisticRegression()

# train the classifier on the training data

model.fit(x_train_bow, y_train)

# get the mean accuracy on the training data

acc_train = model.score(x_train_bow, y_train)

print('Training Accuracy:', acc_train)

Training Accuracy: 0.9812

/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to
converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

33
Habib Mrad – Introduction to NLP

Evaluating the performance of our model through its accuracy score

# Evaluate model with test data

acc_test = model.score(x_test_bow, y_test)
print('Test Accuracy:', acc_test)

Test Accuracy: 0.86624

Let's use the model to predict!

To do so, let's create a predict function which takes as argument our model and the
bag of words vectorizer together with a review on which it would predict the
sentiment.
This review should be cleaned with the clean_review function we built, transformed
by bag of words and then used for prediction with model.predict().

# define predict function

def predict(model, vectorizer, review):
review = clean_review(review) # cleaning
review_bow = vectorizer.transform([review]) # convert it from text to
numbers
return model.predict(review_bow)[0] # return the first prediction

# we have to make sure that our input review has to pass through the s
ame steps
# to be cleaned, preprocessed the same way before it can be fed to the
model
# review = actual review to predict on

And let's try it out on an example

review = 'The movie was great!'

predict(model, vectorizer, review)

34
Habib Mrad – Introduction to NLP

The Word Embedding Model

There is a better way to encode text in NN that addresses the limitations of BoW approach.

 A learned representation for text where words that have the same meaning have
a similar representation.
 Considered one of the breakthroughs of deep learning on NLP problems.
Because it was the first approach that takes the meaning of the words into
consideration

35
Habib Mrad – Introduction to NLP

Word Embeddings - How to use them?

 Compute similarities between words
 Create groups of related words
 Use the features in text classification
 Enrich information about lesser known words by leveraging of more highly used
words:

- Part-of-Speech tagging
- Named-Entity recognition

36
Habib Mrad – Introduction to NLP

Word2Vec
 One of the earliest and most famous word-embedding model, which convert
words into vectors.
 It is unsupervised learning approach where the model is given a text corpus or a
series of documents and the model learns an embedding for the words in this
Corpus
 Statistical method for efficiently learning a standalone word embedding from a
text corpus.
 Developed by Tomas Mikolov, et al. at Google in 2013
 Has become the de facto standard for developing pre-trained word embedding.
 The cool property about word2vec is how similar words have similar vectors
 . so we can draw inference for different words based on Vector similarity

37
Habib Mrad – Introduction to NLP

How Word2vec works?

 Uses small neural networks to calculate word embeddings based on words’ local
context (window of neighboring words). There are 2 different approaches for
word2vec models:

- Continuous Bag-of-Words (CBOW)

- Continuous Skip-Gram model

CBOW works by taking a window of 5 words, hiding the word in the middle, and then
teaching a model to predict this word based on the 4 neighboring words. The CBOW
model learns the embeddings by predicting the current word based on its context

The continuous skip-Gram model works by taking a word and learning how to predict
the four neighboring words based on that word. The continuous skip-Gram model learns
by predicting the surrounding words based on the current word.

38
Habib Mrad – Introduction to NLP

Global Vectors for Word Representation –

GloVe
 Extension to the Word2Vec method learning word vectors.
 Developed by Pennington, et al. at Stanford.
 Problem? Word2Vec ignores statistical properties of the corpus (whether some
context words appear more often than others do – the counting words
appearances or frequencies).
 Word2vec purely generates the model based on where the word appear in the
sentence. For that reason, GloVe was developed to introduce the frequencies of
the co-occurrence to the word2vec, since it holds vital information and should be
integrated into the model. Co-occurrence is when we consider all the unique
words of the corpus and we count the appearances of two unique words 2 by 2
next to each other which holds extra information about the meaning
 GloVe argues that the frequency of co-occurrences is vital information
 GloVe builds word embeddings in a way that a combination of word vectors
relates directly to the probability of these words’ co-occurrence in the corpus.
 In a practical sense, it resembles building a table with columns of all the unique
words, and rows also all unique words and the for each pair, counts that

39
Habib Mrad – Introduction to NLP

appearances next to each other. This expands the ability of the model to
separates words from each other and their meanings.

Gensim
 Open source Python library for natural language processing
 It is billed as “topic modeling for humans”
 It is not a generic NLP framework like NLTK, but rather, a mature, focused, and
efficient suite of NLP tools for topic modeling. It is sometimes called topic
modeling for humans
 Topic modelling involves extracting topic or the content of a certain text. It asks
the question: what are we talking about here?
 It supports an implementation of Word2V model. So it is useful when we want to
train a custom word embedding model or load the pre-trained model for
inference.

40
Habib Mrad – Introduction to NLP

Word Embeddings
Objective: The goal from this exercise is to explore the Word2Vec technique for word
embeddings and introduce Stanford's GloVe embedding as well. The libraries we will
be using are Gensim for Word Embeddings Word2Vec and GloVe, matplotlib for
visualization and Scikit-Learn for Principle Component Analysis models, which
are used for reducing dimensionality.

Learn Word2Vec Embedding using Gensim

Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. However, we
will demonstrate the principles using a small in-memory example of text.
Each sentence must be tokenized (divided into words and prepared). The sentences
could be text loaded into memory, or an iterator that progressively loads text, required
for very large text corpora.
There are many parameters on this constructor:

 size: (default 100) The number of dimensions of the embedding, e.g. the length of the
dense vector to represent each token (word).
 window: (default 5) The maximum distance between a target word and words around the
target word.
 min_count: (default 5) The minimum count of words to consider when training the
model; words with an occurrence less than this count will be ignored.
 workers: (default 3) The number of threads to use while training.
 sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

Building and training a Word2Vec model

from gensim.models import Word2Vec

# define training data (this is a dummy training data)

sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec']
,
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]

# Here the data is splitted into tokens,

# So it is an array of arrays and each array is basically the words of the
sentence

41
Habib Mrad – Introduction to NLP

# We really want to find any word that occurs more than 1 time (min_count=
1). So count all the words that they appear at least one time

# train model
model = Word2Vec(sentences, min_count=1)

# summarize the loaded model

print(model)

# summarize vocabulary
words = list(model.wv.vocab) # model.wv.vocab this is the vocab of the mod
el and we want to convert them to a list
print(words)

# access vector for one word

print(model['sentence']) # this means we want to convert this word into th
e vector form of this word

# save model
model.save('model.bin')

# size=100 each vector will have a size of 100

# ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second',
'yet', 'another', 'one', 'more', 'and', 'final'] : these are the unique wo
rds that we
# have in our vocab

Results:

WARNING:gensim.models.base_any2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for

smoother alpha decay
Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[-7.8457932e-04 3.5207625e-03 -4.2921999e-03 -4.3479460e-03
-4.9049719e-03 -4.2559318e-03 -9.3232520e-04 -3.2275093e-03
5.5988797e-04 -4.5542549e-03 3.4206519e-03 1.3965422e-03
-3.0884927e-03 3.7948424e-03 1.6137204e-03 -4.8686615e-03
1.8643612e-03 -3.2971515e-03 -4.9116723e-03 -3.9235079e-03
-3.7943881e-03 -1.8185180e-03 -3.2840617e-05 4.1652853e-03
-2.0585104e-04 3.0013716e-03 -4.8512327e-03 -5.7977054e-04
3.7493408e-03 -6.9524965e-04 5.2066760e-05 4.2560236e-03
-3.0693677e-03 4.4368575e-03 4.2182300e-03 -1.3236658e-03
-3.0931591e-03 8.3119166e-04 -3.0872936e-03 -1.5930691e-03
-1.5032124e-03 1.9341486e-03 1.0838168e-03 4.0697749e-03
2.8024155e-03 1.7420420e-03 -2.8390333e-03 3.5742642e-03
-2.3106801e-04 2.6949684e-03 -3.6318353e-03 -2.8590560e-03
4.0416783e-03 1.3401215e-03 2.1875310e-03 5.5839861e-04

42
Habib Mrad – Introduction to NLP

3.3788255e-04 4.7156317e-03 2.6413812e-03 -4.4250677e-04

4.1823206e-04 3.6568106e-03 -1.5885477e-03 1.1498865e-03
1.2212624e-03 -3.6188874e-03 1.3244674e-03 2.0759690e-03
-3.3118036e-03 4.2094686e-03 3.4420323e-03 -2.4229037e-03
-2.6206372e-03 2.2858661e-03 9.1098453e-04 8.4567037e-05
5.7552656e-04 -6.4006272e-05 -2.7839965e-03 9.6004776e-04
1.5203831e-03 3.9642006e-03 1.4925852e-03 3.4692041e-03
6.2210555e-04 1.1770288e-03 -4.4826525e-03 -7.5982476e-04
-1.2346446e-03 -3.8285055e-03 -2.2624165e-03 3.1580434e-03
-4.1306936e-03 -4.6889810e-03 -2.6903450e-03 4.7526191e-04
-3.4166775e-03 2.9652417e-03 -4.9173511e-03 -1.1885230e-03]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:25: DeprecationWarning: Call to deprecated
`__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

# let's load the binary model and test it

# load model
new_model = Word2Vec.load('model.bin')
print(new_model['this', 'is'])
[[-2.6341430e-03 -6.8210840e-04 -2.2032256e-03 -2.9977076e-04
4.1053989e-03 -1.6542217e-03 -4.9963145e-04 1.7543014e-03
-1.2514311e-04 -4.7425432e-03 1.7437695e-03 2.6504786e-03
-9.4006537e-04 -1.4253150e-03 3.6435311e-03 -2.3904638e-03
-4.4562640e-03 -4.9993531e-03 -2.2515173e-03 2.3276228e-03
-3.3021998e-03 4.6024695e-03 3.0749412e-03 -4.2429226e-03
-1.5931168e-03 3.8181676e-03 2.4235758e-03 -3.4062283e-03
3.3555010e-03 1.2844945e-03 -2.7234193e-03 -8.0988108e-04
1.6813108e-03 -9.1353070e-04 4.4700159e-03 1.0303746e-03
2.9608165e-03 4.2942618e-03 -3.7393915e-03 -3.5502152e-03
-4.6064076e-03 -4.6758484e-03 4.1863527e-03 3.1615091e-03
4.5192647e-03 -9.1852510e-04 2.6027197e-03 -4.1375658e-03
-3.5489022e-03 -2.2965863e-03 7.5722620e-04 6.1148533e-04
2.1013829e-03 -3.9978596e-04 9.7986590e-04 4.9094297e-03
-2.1577110e-03 4.8989342e-03 3.8428246e-03 2.7198202e-03
-1.4638512e-03 -1.8885430e-03 -2.5638286e-04 -2.0531581e-03
-1.6875562e-03 -2.7395764e-03 5.5740983e-04 2.5572355e-03
5.6512235e-04 -2.9321280e-03 -7.5157406e-04 -1.2344553e-04
1.1710543e-03 -1.1974664e-03 -1.2816156e-03 -4.1503790e-03
4.5915483e-03 -9.5228269e-04 -3.4181602e-04 -2.3537560e-03
-2.6966461e-03 7.6452579e-04 -1.2867393e-03 1.2928311e-03
-3.9543938e-03 -4.6212925e-03 4.1194628e-03 4.6026148e-03
4.3687961e-04 -2.1079446e-03 4.3081054e-03 4.9309372e-03
-2.4515828e-03 -1.7825521e-04 2.4474061e-03 -2.4826333e-03
-4.6479427e-03 4.2938837e-03 -1.1659318e-04 -1.8699563e-03]
[-5.1402091e-04 4.0105209e-03 -6.2171964e-04 2.5609015e-03
-1.8050724e-03 4.9377461e-03 3.2770697e-03 1.8770240e-03
-4.9929242e-03 3.2758827e-03 -1.6969994e-03 -2.6876389e-03
2.9338847e-03 -4.3162694e-03 -4.2533325e-03 4.6241065e-03
-1.0224945e-03 2.1736498e-03 4.4645616e-03 -1.1554005e-03
-3.8527905e-03 -6.9526068e-05 2.0650621e-04 4.6414724e-03
1.1177687e-03 7.0009363e-04 3.9822836e-03 7.9727673e-05
-2.5550727e-04 1.5229684e-03 -3.3755524e-03 4.8815636e-03

43
Habib Mrad – Introduction to NLP

-4.6662665e-03 4.2167231e-03 -3.0159575e-03 -3.6175633e-03

4.4732318e-05 -1.4307767e-03 8.6314336e-04 -2.6571582e-04
2.5633357e-03 -3.5801379e-04 -1.2866167e-03 2.1925850e-03
-4.2773047e-03 -4.9695778e-03 -1.3883609e-03 3.1987922e-03
1.7983645e-03 4.0425011e-03 4.5442330e-03 -1.6893109e-03
-4.9699568e-03 4.2594168e-03 -1.2876982e-03 3.6424017e-03
2.7398604e-03 4.0277229e-03 -2.7221404e-03 4.6593808e-03
2.5286388e-03 -7.1398198e-04 3.0460334e-03 3.1234655e-03
-4.6948571e-04 -4.9914969e-03 2.2993404e-03 3.1571195e-03
1.5602089e-03 -3.4938734e-03 1.7282860e-03 7.2801049e-04
-3.7645237e-03 -1.3775865e-03 5.9858215e-04 8.6858316e-05
2.0116186e-03 3.1379755e-03 -4.1431260e-05 2.1408075e-03
2.7511080e-03 -3.4430167e-03 -1.6758190e-03 -4.2654286e-04
-3.7440318e-03 2.3888228e-03 -3.1758812e-03 1.1063067e-03
-3.5542385e-03 4.4913641e-03 -2.4453797e-03 -4.1569071e-03
4.7013820e-03 4.1635367e-03 1.6596923e-03 3.6739753e-04
-4.3959259e-03 -1.1767038e-03 -4.7597084e-03 3.0412525e-03]]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: DeprecationWarning: Call to deprecated
`__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
"""

Visualize Word Embedding

After learning the word embedding for the text, it is nice to explore it with visualization.
We can use classical projection methods to reduce the high-dimensional word
vectors to two- dimensional plots and plot them on a graph. The visualizations can
provide a qualitative diagnostic for our learned model.

from gensim.models import Word2Vec

from sklearn.decomposition import PCA
from matplotlib import pyplot

# define training data

# train model
model = Word2Vec(sentences, min_count=1)

# fit a 2D PCA model to the vectors

X = model[model.wv.vocab] # the vocab will contains all the vectors for al
l the words in our vocab
pca = PCA(n_components=2) #reduce dimensionality to 2D
result = pca.fit_transform(X) #2D model to plot

44
Habib Mrad – Introduction to NLP

# create a scatter plot of the projection

# pull out the 2 dimensions as x and y
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab) # visulaize the dots with the words

# annotate the points on the graph with the words themselves

for i, word in enumerate(words):
pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) # print the first
and the second dimension
pyplot.show()

Results:

WARNING:gensim.models.base_any2vec:under 10 jobs per worker: consider

setting a smaller `batch_words' for smoother alpha decay
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16:
DeprecationWarning: Call to deprecated `__getitem__` (Method will be
removed in 4.0.0, use self.wv.__getitem__() instead).

app.launch_new_instance()

45
Habib Mrad – Introduction to NLP

Google Word2Vec
Instead of training your own word vectors (which requires a lot of RAM and compute
power), you can simply use a pre-trained word embedding. Google has published a pre-
trained Word2Vec model that was trained on Google news data (about 100 billion
words). It contains 3 million words and phrases and was fit using 300-dimensional
word vectors. It is a 1.53 Gigabyte file.

import gensim.downloader as api

model = api.load("word2vec-google-news-300")

# We can now see that we have a file Googlenews-vectors-

negative300.bin binary file including the weights of the model

[==================================================] 100.0%
1662.8/1662.8MB downloaded

# from gensim.models import keyedvectors

# # Lets load the google word2vec model

# filename = 'Googlenews-vectors-negative300.bin'
# model = KeyedVectors.load_word2vec_format(filename, binary=True)
# model = keyedvectors.load_word2vec_format(filename, binary=True)

46
Habib Mrad – Introduction to NLP

Let's have fun

# get word vector
model['car']

array([ 0.13085938, 0.00842285, 0.03344727, -0.05883789, 0.04003906, -

0.14257812, 0.04931641, -0.16894531, 0.20898438, 0.11962891, 0.18066406, -
0.25 , -0.10400391, -0.10742188, -0.01879883, 0.05200195, -0.00216675,
0.06445312, 0.14453125, -0.04541016, 0.16113281, -0.01611328, -0.03088379,
0.08447266, 0.16210938, 0.04467773, -0.15527344, 0.25390625, 0.33984375,
0.00756836, -0.25585938, -0.01733398, -0.03295898, 0.16308594, -
0.12597656, -0.09912109, 0.16503906, 0.06884766, -0.18945312, 0.02832031,
-0.0534668 , -0.03063965, 0.11083984, 0.24121094, -0.234375 , 0.12353516,
-0.00294495, 0.1484375 , 0.33203125, 0.05249023, -0.20019531, 0.37695312,
0.12255859, 0.11425781, -0.17675781, 0.10009766, 0.0030365 , 0.26757812,
0.20117188, 0.03710938, 0.11083984, -0.09814453, -0.3125 , 0.03515625,
0.02832031, 0.26171875, -0.08642578, -0.02258301, -0.05834961, -
0.00787354, 0.11767578, -0.04296875, -0.17285156, 0.04394531, -0.23046875,
0.1640625 , -0.11474609, -0.06030273, 0.01196289, -0.24707031, 0.32617188,
-0.04492188, -0.11425781, 0.22851562, -0.01647949, -0.15039062, -
0.13183594, 0.12597656, -0.17480469, 0.02209473, -0.1015625 , 0.00817871,
0.10791016, -0.24609375, -0.109375 , -0.09375 , -0.01623535, -0.20214844,
0.23144531, -0.05444336, -0.05541992, -0.20898438, 0.26757812, 0.27929688,
0.17089844, -0.17578125, -0.02770996, -0.20410156, 0.02392578, 0.03125 , -
0.25390625, -0.125 , -0.05493164, -0.17382812, 0.28515625, -0.23242188,
0.0234375 , -0.20117188, -0.13476562, 0.26367188, 0.00769043, 0.20507812,
-0.01708984, -0.12988281, 0.04711914, 0.22070312, 0.02099609, -0.29101562,
-0.02893066, 0.17285156, 0.04272461, -0.19824219, -0.04003906, -
0.16992188, 0.10058594, -0.09326172, 0.15820312, -0.16503906, -0.06054688,
0.19433594, -0.07080078, -0.06884766, -0.09619141, -0.07226562,
0.04882812, 0.07324219, 0.11035156, 0.04858398, -0.17675781, -0.33789062,
0.22558594, 0.16308594, 0.05102539, -0.08251953, 0.07958984, 0.08740234, -
0.16894531, -0.02160645, -0.19238281, 0.03857422, -0.05102539, 0.21972656,
0.08007812, -0.21191406, -0.07519531, -0.15039062, 0.3046875 , -
0.17089844, 0.12353516, -0.234375 , -0.10742188, -0.06787109, 0.01904297,
-0.14160156, -0.22753906, -0.16308594, 0.14453125, -0.15136719, -0.296875
, 0.22363281, -0.10205078, -0.0456543 , -0.21679688, -0.09033203, 0.09375
, -0.15332031, -0.01550293, 0.3046875 , -0.23730469, 0.08935547,
0.03710938, 0.02941895, -0.28515625, 0.15820312, -0.00306702, 0.06054688,
0.00497437, -0.15234375, -0.00836182, 0.02197266, -0.12109375, -
0.13867188, -0.2734375 , -0.06835938, 0.08251953, -0.26367188, -
0.16992188, 0.14746094, 0.08496094, 0.02075195, 0.13671875, -0.04931641, -
0.0100708 , -0.00369263, -0.10839844, 0.14746094, -0.15527344, 0.16113281,
0.05615234, -0.05004883, -0.1640625 , -0.26953125, 0.4140625 , 0.06079102,
-0.046875 , -0.02514648, 0.10595703, 0.1328125 , -0.16699219, -0.04907227,
0.04663086, 0.05151367, -0.07958984, -0.16503906, -0.29882812, 0.06054688,
-0.15332031, -0.00598145, 0.06640625, -0.04516602, 0.24316406, -
0.07080078, -0.36914062, -0.23144531, -0.11914062, -0.08300781,
0.14746094, -0.05761719, 0.23535156, -0.12304688, 0.14648438, 0.13671875,
0.15429688, 0.02111816, -0.09570312, 0.05859375, 0.03979492, -0.08105469,
0.0559082 , -0.16601562, 0.27148438, -0.20117188, -0.00915527, 0.07324219,

47
Habib Mrad – Introduction to NLP

0.10449219, 0.34570312, -0.26367188, 0.02099609, -0.40039062, -0.03417969,

-0.15917969, -0.08789062, 0.08203125, 0.23339844, 0.0213623 , -0.11328125,
0.05249023, -0.10449219, -0.02380371, -0.08349609, -0.04003906,
0.01916504, -0.01226807, -0.18261719, -0.06787109, -0.08496094, -
0.03039551, -0.05395508, 0.04248047, 0.12792969, -0.27539062, 0.28515625,
-0.04736328, 0.06494141, -0.11230469, -0.02575684, -0.04125977,
0.22851562, -0.14941406, -0.15039062], dtype=float32)

# get most similar words

model.most_similar('yellow')

[('red', 0.751919150352478), ('bright_yellow', 0.6869138479232788),

('orange', 0.6421886682510376), ('blue', 0.6376121044158936), ('purple',
0.6272757053375244), ('yellows', 0.612633228302002), ('pink',
0.6098285913467407), ('bright_orange', 0.5974606871604919),
('Warplanes_streaked_overhead', 0.583052396774292), ('participant_LOGIN',
0.5816755294799805)]

# queen = (king - man) + woman

result = model.most_similar(positive=['woman', 'king'], negative=['man'],
topn=1)
print(result)

[('queen', 0.7118192911148071)]

# (france - paris) + spain = ?

result = model.most_similar(positive=["paris","spain"], negative=["france"
], topn=1)
print(result)

[('madrid', 0.5295541286468506)]

model.doesnt_match(["red", "blue", "car", "orange"])

# we give the model a list of words and it gives us the one that doesnt ma
tch the odd one
# the word 'car' has the lowest distance with other ones

/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py:895:
FutureWarning: arrays to stack must be passed as a "sequence" type such as
list or tuple. Support for non-sequence iterables such as generators is
deprecated as of NumPy 1.16 and will raise an error in the future.

48
Habib Mrad – Introduction to NLP

vectors = vstack(self.word_vec(word, use_norm=True) for word in

used_words).astype(REAL)
'
car

Stanford’s GloVe Embedding

Like Word2Vec, the GloVe researchers also provide pre-trained word vectors. Let's
download the smallest GloVe pre-trained model from the GloVe website. It is a 822
Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors)
trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.

# download
!wget http://nlp.stanford.edu/data/glove.6B.zip

# unzip downloaded word embeddings

!unzip glove.6B.zip

# list files in current directoty

!ls -lah

--2022-09-16 13:42:06-- http://nlp.stanford.edu/data/glove.6B.zip

Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80...
connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-09-16 13:42:06-- https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443...
connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
[following]
--2022-09-16 13:42:06--
https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)...
171.64.64.22
Connecting to downloads.cs.stanford.edu
(downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip 100%[===================>] 822.24M 5.01MB/s in 2m

39s

49
Habib Mrad – Introduction to NLP

2022-09-16 13:44:46 (5.16 MB/s) - ‘glove.6B.zip’ saved

[862182613/862182613]

Archive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
total 2.9G
drwxr-xr-x 1 root root 4.0K Sep 16 13:45 .
drwxr-xr-x 1 root root 4.0K Sep 16 12:50 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-rw-r-- 1 root root 332M Aug 4 2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root 662M Aug 4 2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root 990M Aug 27 2014 glove.6B.300d.txt
-rw-rw-r-- 1 root root 164M Aug 4 2014 glove.6B.50d.txt
-rw-r--r-- 1 root root 823M Oct 25 2015 glove.6B.zip
-rw-r--r-- 1 root root 22K Sep 16 13:03 model.bin

drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data

from gensim.models import KeyedVectors

from gensim.scripts.glove2word2vec import glove2word2vec

# convert the 100-

dimensional version of the glove model to word2vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# load the converted model

filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# binary=False because the File is not binary now. It is just a converted
text file
# Here the model named 'model' is loaded into memory

# calculate: (king - man) + woman = ?

result = model.most_similar(positive=['woman', 'king'], negative=['man'],
topn=1)
print(result)

[('queen', 0.7698541283607483)

Newgen Software - Corporate Brochure
No ratings yet
Newgen Software - Corporate Brochure
23 pages
Growing Your Business
67% (3)
Growing Your Business
34 pages
Andersson, Gerhard - The Internet and CBT - A Clinical Guide-CRC Press (2014)
No ratings yet
Andersson, Gerhard - The Internet and CBT - A Clinical Guide-CRC Press (2014)
164 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Hadi Pres, 21-12-24-1
No ratings yet
Hadi Pres, 21-12-24-1
16 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Human Communication, Either Spoken or Written, Consisting of The Use of Words in A Structured and Conventional Way". Language Makes Us Unique From Other Living Beings and I Would
No ratings yet
Human Communication, Either Spoken or Written, Consisting of The Use of Words in A Structured and Conventional Way". Language Makes Us Unique From Other Living Beings and I Would
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
NLP m2
No ratings yet
NLP m2
71 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Worksheet Notes
No ratings yet
Worksheet Notes
22 pages
Harambe University
No ratings yet
Harambe University
8 pages
NLP IA1
No ratings yet
NLP IA1
7 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Lect01
No ratings yet
Lect01
28 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP
No ratings yet
NLP
16 pages
Nlpslide
No ratings yet
Nlpslide
21 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
36 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
Lesson 1 - NLP
No ratings yet
Lesson 1 - NLP
38 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
CH1
No ratings yet
CH1
87 pages
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
No ratings yet
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
163 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Module-I_NLP (1)
No ratings yet
Module-I_NLP (1)
35 pages
NLP Complete - BEPEC - Opendir - Cloud
No ratings yet
NLP Complete - BEPEC - Opendir - Cloud
17 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
BI module 5
No ratings yet
BI module 5
11 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Python NLP
No ratings yet
Python NLP
15 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
CCPS521-WIN2023-Week10 Natural Language Processing Final
No ratings yet
CCPS521-WIN2023-Week10 Natural Language Processing Final
56 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
NLP_Week_01
No ratings yet
NLP_Week_01
57 pages
NLP
No ratings yet
NLP
14 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
NLP
No ratings yet
NLP
11 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP-UNIT-1-1
No ratings yet
NLP-UNIT-1-1
67 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
1 - Intro - To - NLP 2
No ratings yet
1 - Intro - To - NLP 2
55 pages
Philosophy of Language: The Classics Explained
From Everand
Philosophy of Language: The Classics Explained
Colin McGinn
3/5 (1)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Express Entry Application Steps
No ratings yet
Express Entry Application Steps
2 pages
Schedule 50_2024
No ratings yet
Schedule 50_2024
1 page
Writing - Task 1 - GT
No ratings yet
Writing - Task 1 - GT
8 pages
ChatGPT for Data Analytics Full Course
No ratings yet
ChatGPT for Data Analytics Full Course
3 pages
5.4 - Eigendecomposition
No ratings yet
5.4 - Eigendecomposition
2 pages
alarm_data
No ratings yet
alarm_data
3 pages
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
No ratings yet
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
39 pages
03. Additional Instructions for Express Entry Canada
No ratings yet
03. Additional Instructions for Express Entry Canada
6 pages
ChatGPT Mastery - Zaka
No ratings yet
ChatGPT Mastery - Zaka
10 pages
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
No ratings yet
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
25 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
3.0 - Matrix Properties
No ratings yet
3.0 - Matrix Properties
2 pages
Cardiology Today Next Gen Innovators: Meet The
100% (1)
Cardiology Today Next Gen Innovators: Meet The
1 page
Learning Guide: Cardiovascular Diseases: Be Able To Discuss Each of The Following
No ratings yet
Learning Guide: Cardiovascular Diseases: Be Able To Discuss Each of The Following
2 pages
Regbook Inside
No ratings yet
Regbook Inside
21 pages
3.2 - Hypothesis Testing (P-Value Approach)
No ratings yet
3.2 - Hypothesis Testing (P-Value Approach)
3 pages
4.0 - Matrix Inverse
No ratings yet
4.0 - Matrix Inverse
2 pages
Huang Meta Analyses Stat Methods Med Res 2014 0962280214537394
No ratings yet
Huang Meta Analyses Stat Methods Med Res 2014 0962280214537394
35 pages
Class Notes
No ratings yet
Class Notes
147 pages
Ranking Problems: 9.520 Class 09, 08 March 2006 Giorgos Zacharia
No ratings yet
Ranking Problems: 9.520 Class 09, 08 March 2006 Giorgos Zacharia
27 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
Class 03
No ratings yet
Class 03
40 pages
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
No ratings yet
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
25 pages
Class 01
No ratings yet
Class 01
75 pages
Class 02
No ratings yet
Class 02
42 pages
Aptitude Test
No ratings yet
Aptitude Test
7 pages
Examen Final - Semana 8 - ESP - SEGUNDO BLOQUE-INGLES GENERAL IV - (GRUPO1) PDF
No ratings yet
Examen Final - Semana 8 - ESP - SEGUNDO BLOQUE-INGLES GENERAL IV - (GRUPO1) PDF
9 pages
Constant Head Permeability Test
No ratings yet
Constant Head Permeability Test
12 pages
CLAT e Brochure
No ratings yet
CLAT e Brochure
2 pages
Economic dispatch
No ratings yet
Economic dispatch
12 pages
How To Write A Perfect Story For The Cambridge b1 (Preliminary) - Free PDF With Example Questions - Intercambio Idiomas Online
No ratings yet
How To Write A Perfect Story For The Cambridge b1 (Preliminary) - Free PDF With Example Questions - Intercambio Idiomas Online
5 pages
1st Summative Test in BPP
75% (4)
1st Summative Test in BPP
2 pages
E.N. Kulkov - Oleg Aleksandrovich Rzheshevskii - Harold Shukman - Stalin and The Soviet-Finnish War, 1939-1940-Routledge (2014)
No ratings yet
E.N. Kulkov - Oleg Aleksandrovich Rzheshevskii - Harold Shukman - Stalin and The Soviet-Finnish War, 1939-1940-Routledge (2014)
336 pages
Genital Prolapse
No ratings yet
Genital Prolapse
30 pages
Mondayy HW Equilibria Due March 9 2022
No ratings yet
Mondayy HW Equilibria Due March 9 2022
2 pages
ToolsTechniques Guide FINAL Web Watermark
No ratings yet
ToolsTechniques Guide FINAL Web Watermark
120 pages
Cambridge IGCSE: ECONOMICS 0455/21
No ratings yet
Cambridge IGCSE: ECONOMICS 0455/21
8 pages
Biology
No ratings yet
Biology
2 pages
MBA Resume
No ratings yet
MBA Resume
2 pages
Revolutionizing Circuit Protection With SiC JFETs
No ratings yet
Revolutionizing Circuit Protection With SiC JFETs
7 pages
Spe Llcer Anglais 2022 Amerique Nord Sujet Officiel
No ratings yet
Spe Llcer Anglais 2022 Amerique Nord Sujet Officiel
10 pages
Body Idioms
No ratings yet
Body Idioms
9 pages
Topics in Dental Biochemistry
No ratings yet
Topics in Dental Biochemistry
322 pages
Final_VLSI_Diploma
No ratings yet
Final_VLSI_Diploma
106 pages
RPS New Syllabus
No ratings yet
RPS New Syllabus
6 pages
Section 1: The Ielts Hub
No ratings yet
Section 1: The Ielts Hub
14 pages
Women Entrepreneurship
No ratings yet
Women Entrepreneurship
9 pages
Harga Cartonan 6 Juli
No ratings yet
Harga Cartonan 6 Juli
67 pages
Monster Week 2 Daily Academic Vocabulary Week 2 Week of March 31 2014 S
No ratings yet
Monster Week 2 Daily Academic Vocabulary Week 2 Week of March 31 2014 S
67 pages
WebADI New Features
No ratings yet
WebADI New Features
38 pages
Sweet Words To My Pals
No ratings yet
Sweet Words To My Pals
5 pages
Unintended Consequences British English Teacher
No ratings yet
Unintended Consequences British English Teacher
6 pages