Introduction To NLP
Introduction To NLP
1. Lexical Ambiguity – Words have multiple meanings (ex Bass: fish or Bass: musical instrument)
2. Syntactic Ambiguity – Sentence is having multiple parse trees. Therefore, resulting in different
meanings
4. Anaphoric Ambiguity - Phrase or word which is previously mentioned but has a different
meaning.
Parsing is the process of breaking down a sentence into its elements so that the sentence can be
understood. Traditional parsing is done by hand, sometimes using sentence diagrams. Parsing is also
involved in more complex forms of analysis such as discourse analysis and psycholinguistics.
1
Habib Mrad – Introduction to NLP
Semantic ambiguity: I saw her duck: has 2 meanings depending on the meaning of the word duck
If it is the verb duck, the sentence means the girl lower the head
If I is talking about the animal duck, means I saw her pet animal
So the meaning of the whole sentence depends on the meaning of the word duck
Applications of NLP
Part-Of-speech (POS) tagging: a way to let computer understand text, it to assign a tag to each
word in the sentence: verb, noun, adjective...Etc
Break down the sentence into words and assign a tag to each word
2
Habib Mrad – Introduction to NLP
Another application of NLP is Name Entity Recognition (NER): Extracting specific predefined
entities from text such as place, names dates, locations…etc.
Text translation
3
Habib Mrad – Introduction to NLP
Visual Question Answering: combine CV and NLP. Train AI model to look at an image and the answer
question that we ask. The model has to understand the content of the image and then understand the
question asked before it can answer
Image captioning: combine CV and NLP. You show the model an image and the model has to describe
the image with words or generate the sentence that describes the image
4
Habib Mrad – Introduction to NLP
Automatic Handwriting Generation: train a model to generate handwritten text that look realistic
5
Habib Mrad – Introduction to NLP
Data Preparation:
1. Text cleaning
Remove punctuation and special characters (@!.,><-+#$%^) as these don’t hold any extra
information
Remove numbers and emojis or replaced by something more meaningful
Removing leading and trailing whitespaces : take space and doesn’t have any meaningful
information
Remove stopwords (of, at, by, for, with, etc..) : they don’t hold information and are frequently
repeated
Normalizing case (Apple vs apple) : transforming to lowercase
Remove HTML/XML tags : if data is collected with web scraping
Replacing accented characters (such as é)
Correcting spelling errors : when multiple words are similar which can confuse the model
2. Tokenization
Turning a string or document into tokens (smaller chunks of text).
Important Step in preparing text for NLP
Different approaches:
Split by Whitespace
Split by word
Custom regex (regular expression)
6
Habib Mrad – Introduction to NLP
7
Habib Mrad – Introduction to NLP
Text Cleaning
This collab will show you how to clean your data in preparation for the NLP. In the
first section, we will be using built-in python functions as for the second, we will
introduce NLTK libraries.
8
Habib Mrad – Introduction to NLP
print(words[:100])
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who’s', 'ever', 'lived.',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy,', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world.', 'Nonetheless,', 'when', 'he', 'applied', 'his',
'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917,', 'while',
'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University',
'of', 'Berlin,', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '"cosmological', 'constant".', 'He', 'discarded', 'this', 'notion',
'when', 'it', 'had', 'been', 'established', 'that', 'the', 'universe']
Split by word
Using regular expression re and splitting based on words. Notice the difference in
"who's".
print(words[:100])
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', '1917', 'while', 'serving',
'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of',
'Berlin', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when',
'it', 'had', 'been', 'established', 'that', 'the']
9
Habib Mrad – Introduction to NLP
Normalizing case
Normalizing is when we turn all the words of the document into lower case. Careful
however not always employing this method because it might change the entire meaning.
For example take the French telecom company Orange and the fruit orange. Normalizing
would change the entire meaning.
NLTK
The Natural Language Toolkit is a suite of libraries and programs for symbolic
and statistical natural language processing for English, written in the Python
programming language. (Wikipedia)
import nltk
from nltk import sent_tokenize # class responsible for tokenizing (splitti
ng) our text into sentences
nltk.download('punkt') #punkt tokenizer is an algorithm responsible for t
okezing the text into sentences
10
Habib Mrad – Introduction to NLP
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', '’', 's', 'ever',
'lived', '.', 'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', ',',
'but', 'that', 'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', '.', 'Nonetheless', ',', 'when', 'he',
'applied', 'his', 'theory', 'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in',
'1917', ',', 'while', 'serving', 'as', 'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at',
'the', 'University', 'of', 'Berlin', ',', 'Einstein', 'suggested', 'the', 'notion', 'of', 'a', '``', 'cosmological', 'constant', "''", '.',
'He']
11
Habib Mrad – Introduction to NLP
['Albert', 'Einstein', 'is', 'widely', 'celebrated', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientists', 'who', 's', 'ever', 'lived',
'His', 'fame', 'was', 'due', 'to', 'his', 'original', 'and', 'creative', 'theories', 'that', 'at', 'first', 'seemed', 'crazy', 'but', 'that',
'later', 'turned', 'out', 'to', 'represent', 'the', 'actual', 'physical', 'world', 'Nonetheless', 'when', 'he', 'applied', 'his', 'theory',
'of', 'general', 'relativity', 'to', 'the', 'universe', 'as', 'a', 'whole', 'in', 'a', 'paper', 'published', 'in', 'while', 'serving', 'as',
'Director', 'of', 'the', 'Kaiser', 'Wilhelm', 'Institute', 'for', 'Physics', 'and', 'professor', 'at', 'the', 'University', 'of', 'Berlin',
'Einstein', 'suggested', 'the', 'notion', 'of', 'a', 'cosmological', 'constant', 'He', 'discarded', 'this', 'notion', 'when', 'it', 'had',
'been', 'established', 'that', 'the', 'universe']
Remove stopwords
Stopwords are the words which do not add much meaning to a sentence. They can
safely be ignored without sacrificing the meaning of the sentence. The most common
are short function words such as the, is, at, which, and on, etc.
In this case, removing stopwords can cause problems when searching for phrases that
include them, particularly in names such as “The Who” or “Take That”.
Including the word "not" as a stopword also changes the entire meaning if removed (try
"this code is not good")
12
Habib Mrad – Introduction to NLP
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was',
'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan',
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
As you can see, the stopwords are all lower case and don't have punctuation. If
we're to compare them with our tokens, we need to make sure that our text is prepared
the same way. (Our text is having the same format as stopwords. So we have to clean
the text before we can compare it).
This cell recaps all what we have previously learnt in this colab: tokenizing, lower
casing and checking for alphabetic words.
['albert', 'einstein', 'widely', 'celebrated', 'one', 'brilliant', 'scientists', 'ever', 'lived', 'fame', 'due', 'original', 'creative',
'theories', 'first', 'seemed', 'crazy', 'later', 'turned', 'represent', 'actual', 'physical', 'world', 'nonetheless', 'applied',
'theory', 'general', 'relativity', 'universe', 'whole', 'paper', 'published', 'serving', 'director', 'kaiser', 'wilhelm', 'institute',
13
Habib Mrad – Introduction to NLP
'physics', 'professor', 'university', 'berlin', 'einstein', 'suggested', 'notion', 'cosmological', 'constant', 'discarded',
'notion', 'established', 'universe', 'indeed', 'expanding', 'contributions', 'physics', 'made', 'possible', 'envision',
'universe', 'order', 'understand', 'einstein', 'contribution', 'cosmology', 'helpful', 'begin', 'theory', 'gravity', 'rather',
'thinking', 'gravity', 'attractive', 'force', 'two', 'objects', 'tradition', 'isaac', 'newton', 'einstein', 'conception', 'gravity',
'property', 'massive', 'objects', 'bends', 'space', 'time', 'around', 'example', 'consider', 'question', 'moon', 'fly', 'space',
'rather', 'staying', 'orbit', 'around', 'earth', 'newton', 'would']
Stem words
Stemming refers to the process of reducing each word to its root or base form.
There are two types of stemmers for suffix stripping: porter and lancaster and each
has its own algorithm and sometimes display different outputs.
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])
['albert', 'einstein', 'is', 'wide', 'celebr', 'as', 'one', 'of', 'the', 'most', 'brilliant', 'scientist', 'who', '’', 's', 'ever', 'live', '.', 'hi',
'fame', 'wa', 'due', 'to', 'hi', 'origin', 'and', 'creativ', 'theori', 'that', 'at', 'first', 'seem', 'crazi', ',', 'but', 'that', 'later', 'turn',
'out', 'to', 'repres', 'the', 'actual', 'physic', 'world', '.', 'nonetheless', ',', 'when', 'he', 'appli', 'hi', 'theori', 'of', 'gener', 'rel',
'to', 'the', 'univers', 'as', 'a', 'whole', 'in', 'a', 'paper', 'publish', 'in', '1917', ',', 'while', 'serv', 'as', 'director', 'of', 'the',
'kaiser', 'wilhelm', 'institut', 'for', 'physic', 'and', 'professor', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the',
'notion', 'of', 'a', '``', 'cosmolog', 'constant', "''", '.', 'he']
# stemming of words
lancaster = LancasterStemmer()
stemmed = [lancaster.stem(word) for word in tokens]
print(stemmed[:100])
14
Habib Mrad – Introduction to NLP
['albert', 'einstein', 'is', 'wid', 'celebr', 'as', 'on', 'of', 'the', 'most', 'bril', 'sci', 'who', '’', 's', 'ev', 'liv', '.', 'his', 'fam', 'was',
'due', 'to', 'his', 'origin', 'and', 'cre', 'the', 'that', 'at', 'first', 'seem', 'crazy', ',', 'but', 'that', 'lat', 'turn', 'out', 'to', 'repres',
'the', 'act', 'phys', 'world', '.', 'nonetheless', ',', 'when', 'he', 'apply', 'his', 'the', 'of', 'gen', 'rel', 'to', 'the', 'univers', 'as', 'a',
'whol', 'in', 'a', 'pap', 'publ', 'in', '1917', ',', 'whil', 'serv', 'as', 'direct', 'of', 'the', 'kais', 'wilhelm', 'institut', 'for', 'phys',
'and', 'profess', 'at', 'the', 'univers', 'of', 'berlin', ',', 'einstein', 'suggest', 'the', 'not', 'of', 'a', '``', 'cosmolog', 'const', "''", '.',
'he']
15
Habib Mrad – Introduction to NLP
For ex:
We have 3 sentences in our text Corpus.
The first step in BoW is to collect each unique word and give it a certain ID or a number:
this is called the word_index
Next step we create a vector of size 5 because our word_index is made of 5 words. And
place 1 whenever the word exists in the sentence and zero if doesn’t exist
Instead of placing 1 whenever the word exists, there are other methods for scoring words. For
example, we can put the count, how many times the word occurred OR the frequency how
frequent this word appear in the text Corpus
16
Habib Mrad – Introduction to NLP
Word Counts:
A simple way to tokenize text and build vocabulary of known words (sometimes called the
word_index).
Term Frequency (TF): how often a given word appears within a document.
Inverse Document Frequency (IDF): how rare the word is across documents.
17
Habib Mrad – Introduction to NLP
Using scikit-learn for feature extraction. It tokenize the text and builds a vocabulary of words
using the DF-IDF approach
18
Habib Mrad – Introduction to NLP
Limitations of BoW
Vocabulary: it requires generating a long vocabulary of long words in the corpus. So,
applying different text cleaning steps will results in different words and therefore
different vocabulary. The vocab requires careful design, specifically to manage the size,
which impacts the sparsity of the document representations.
Sparsity: Also having a large vocabulary of words will result in having a large sparse
vector representation that is usually filled with zeros and few ones. So Sparse
representations are harder to model both for computational reasons and also for
information reasons, where the challenge is for the models to harness so little
information in such a large representational space. (harder to train the model, so also
harder to achieve good results)
The 2 sentences are opposed in meaning but their BoWs representations result
in the same vector since BoW doesn’t care about the order of the words.
Meaning: Discarding word order ignores the context of the sentence, and in turn lose
the meaning of words in the document (semantics). Context and meaning can offer a lot
to the model and if this information is modeled, it could tell the difference between the
same words that are differently arranged
19
Habib Mrad – Introduction to NLP
This notebook shows the most basic applications of the different functions used
to create bag of words for texts
Word Counts
In this cell, we can see how the application of the fit() and transform() functions enabled
us to create a vocabulary of 8 words for the document. The vectorizer fit allowed
us to build the vocab and the transform encoded it into number of appearances of
each word. The indexing is done alphabetically.
# summarize
print(vectorizer.vocabulary_) #print the vocabulary of this vectorizer
# encode document
vector = vectorizer.transform(text) # transform the text: convert the word
s into numbers based on how they appeared in the sentence.
# we get the vector which is the encoded document of the text
# summarize encoded vector
20
Habib Mrad – Introduction to NLP
# (1, 8): size of the vector with 1 row of size 8 (bcz we have 8 unique wo
rds in our words vocabulary)
# [[1 1 1 1 1 1 1 2]]: value of the vector. Lets take the first item '1'
having an index 0, in this case 'brown'. So, how many times do we have 'br
own' in our text?
# we have 1 time the word 'Brown', so we put 1
# This is how we are converting our text into numbers by counting the numb
er of times each word appears in the text
# in this example we are NOT going to fit or leran a new vocab, we are jus
t going to transform this text using the already fitted vectorizer
text = ["The quick quick quick fox jumped over a big dog"]
# encode document
vector = vectorizer.transform(text)
# this is how we are coverting text into number using the count vectorizer
class. any words doent exist in our vocab simply is not counted
(1, 8)
[[0 1 1 1 0 1 3 1]]
21
Habib Mrad – Introduction to NLP
Word frequencies
This section deals with the application of the TF-IDF formula in order to describe word
appearances relative to multiple documents
Source: https://medium.com/nlpgurukool/tfidf-vectorizer-5421f1528402
This is another approach to vectorize or convert our text into numbers
# summarize
print(vectorizer.vocabulary_) # print the vocab
print(vectorizer.idf_) # print the IDs
# encode document
vector = vectorizer.transform([text[0]]) # transform the original text int
o TF-IDF representation of this text using the transform function
22
Habib Mrad – Introduction to NLP
You can tell how, for example, the word 'brown' has a higher multiplier than 'the' even
though the latter has two appearances in the first document. This is due to the fact 'the'
appears in the other two documents which reduces its uniqueness
text = ["The quick quick quick fox jumped over a big dog"]
# encode document
vector = vectorizer.transform(text) # vectorize this text using the vector
izer
# The first item 'brown has a value 0 because this text down not countain
the word 'brown'
# The word 'quick' which hving index of 6 has the highest value here 0.848
33724 because it was p[resent more than one. So the term frequency is ver
y high
We find ourselves with two null values because this document does not include the
words 'brown' or 'lazy'. And since the word 'big' is not included in the vocab, it is discarded.
23
Habib Mrad – Introduction to NLP
sentiment_analysis.ipynb
Text classification
Text classification is one of the important tasks of text mining
24
Habib Mrad – Introduction to NLP
--2022-09-16 10:20:36--
https://raw.githubusercontent.com/javaidnabi31/Word-Embeddding-Sentiment-
Classification/master/movie_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)...
185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/plain]
Saving to: ‘movie_data.csv’
25
Habib Mrad – Introduction to NLP
total 63M
drwxr-xr-x 1 root root 4.0K Sep 16 08:30 .
drwxr-xr-x 1 root root 4.0K Sep 16 08:27 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-r--r-- 1 root root 63M Sep 16 10:20 movie_data.csv
drwxr-xr-x 1 root root 4.0K Sep 14 13:44 sample_data
26
Habib Mrad – Introduction to NLP
# Dtype = int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 50000 non-null object
1 sentiment 50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.4+ KB
A balanced dataset in sentiment analysis is a dataset which holds an equal amount of positive
sentiment data and negative sentiment data, meaning 50% of the data is positive and 50% is
negative
df.sentiment.value_counts()
1 25000
0 25000
Name: sentiment, dtype: int64
df.sentiment.value_counts().plot(kind='bar')
27
Habib Mrad – Introduction to NLP
Text cleaning
print(df.review[10]) # print the 10th review (at index 10)
I loved this movie from beginning to end.I am a musician and i let drugs get in the way of my some of the things i
used to love(skateboarding,drawing) but my friends were always there for me.Music was like my rehab,life
support,and my drug.It changed my life.I can totally relate to this movie and i wish there was more i could say.This
movie left me speechless to be honest.I just saw it on the Ifc channel.I usually hate having satellite but this was a
perk of having satellite.The ifc channel shows some really great movies and without it I never would have found this
movie.Im not a big fan of the international films because i find that a lot of the don't do a very good job on
translating lines.I mean the obvious language barrier leaves you to just believe thats what they are saying but its not
that big of a deal i guess.I almost never got to see this AMAZING movie.Good thing i stayed up for it instead of
going to bed..well earlier than usual.lol.I hope you all enjoy the hell of this movie and Love this movie just as much
as i did.I wish i could type this all in caps but its again the rules i guess thats shouting but it would really show my
excitement for the film.I Give It Three Thumbs Way Up!<br /><br />This Movie Blew ME AWAY!
Let's define a function (clean_review) that would clean each movie review (sentence)
28
Habib Mrad – Introduction to NLP
# Porter Stemmer
stemmed = [stemmer.stem(word) for word in words]# loop over to array of
words using the list apprehension and then each time calling stemmer.stem
on each word
# stemmed is an array that contained the stemmed version of the words ar
ray
# remove stopwords
# loop over each word in the text and check if this word is part of the
stopwords : if it is NOTthen you have to add it to an array
# and finally we have to recombine the array into the full text
29
Habib Mrad – Introduction to NLP
# since i need to have the full text NOT the array, i assign to text
return text
print(df['review'][1])
print(clean_review(df['review'][1]))
Actor turned director Bill Paxton follows up his promising debut, the
Gothic-horror "Frailty", with this family friendly sports drama about the
1913 U.S. Open where a young American caddy rises from his humble
background to play against his Bristish idol in what was dubbed as "The
Greatest Game Ever Played." I'm no fan of golf, and these scrappy underdog
sports flicks are a dime a dozen (most recently done to grand effect with
"Miracle" and "Cinderella Man"), but some how this film was enthralling
all the same.<br /><br />The film starts with some creative opening
credits (imagine a Disneyfied version of the animated opening credits of
HBO's "Carnivale" and "Rome"), but lumbers along slowly for its first by-
the-numbers hour. Once the action moves to the U.S. Open things pick up
very well. Paxton does a nice job and shows a knack for effective
directorial flourishes (I loved the rain-soaked montage of the action on
day two of the open) that propel the plot further or add some unexpected
psychological depth to the proceedings. There's some compelling character
development when the British Harry Vardon is haunted by images of the
aristocrats in black suits and top hats who destroyed his family cottage
as a child to make way for a golf course. He also does a good job of
visually depicting what goes on in the players' heads under pressure.
Golf, a painfully boring sport, is brought vividly alive here. Credit
should also be given the set designers and costume department for creating
an engaging period-piece atmosphere of London and Boston at the beginning
of the twentieth century.<br /><br />You know how this is going to end not
only because it's based on a true story but also because films in this
genre follow the same template over and over, but Paxton puts on a better
than average show and perhaps indicates more talent behind the camera than
he ever had in front of it. Despite the formulaic nature, this is a nice
and easy film to root for that deserves to find an audience.
actor turn director bill paxton follow hi promis debut gothic horror
frailti thi famili friendli sport drama u open young american caddi rise
hi humbl background play hi bristish idol wa dub greatest game ever play
30
Habib Mrad – Introduction to NLP
fan golf scrappi underdog sport flick dime dozen recent done grand effect
miracl cinderella man thi film wa enthral br br film start creativ open
credit imagin disneyfi version anim open credit hbo carnival rome lumber
along slowli first number hour onc action move u open thing pick veri well
paxton doe nice job show knack effect directori flourish love rain soak
montag action day two open propel plot add unexpect psycholog depth
proceed compel charact develop british harri vardon haunt imag aristocrat
black suit top hat destroy hi famili cottag child make way golf cours also
doe good job visual depict goe player head pressur golf pain bore sport
brought vividli aliv credit also given set design costum depart creat
engag period piec atmospher london boston begin twentieth centuri br br
know thi go end onli becaus base true stori also becaus film thi genr
follow templat paxton put better averag show perhap indic talent behind
camera ever front despit formula natur thi nice easi film root deserv find
audienc
31
Habib Mrad – Introduction to NLP
y = df['sentiment'].values
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
# (25000,) (25000,): we have 25000 rows for the training input. Each input
is only one item because we have 1 string of words (1 review)
# and trthe second is 25000 rows and 1 integer which is the review
# here we want to convert our reviews from being an array of words to arra
y of numbers
# FILL BLANKS
# define a CountVectorizer (with binary=True and max_features=10000)
vectorizer = CountVectorizer(binary = True, max_features=10000)
32
Habib Mrad – Introduction to NLP
print(x_train_bow.shape, y_train.shape)
print(x_test_bow.shape, y_test.shape)
Classification
Our data is ready for classification. Let's use LogisticRegression
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
33
Habib Mrad – Introduction to NLP
# we have to make sure that our input review has to pass through the s
ame steps
# to be cleaned, preprocessed the same way before it can be fed to the
model
# review = actual review to predict on
34
Habib Mrad – Introduction to NLP
There is a better way to encode text in NN that addresses the limitations of BoW approach.
A learned representation for text where words that have the same meaning have
a similar representation.
Considered one of the breakthroughs of deep learning on NLP problems.
Because it was the first approach that takes the meaning of the words into
consideration
35
Habib Mrad – Introduction to NLP
- Part-of-Speech tagging
- Named-Entity recognition
36
Habib Mrad – Introduction to NLP
Word2Vec
One of the earliest and most famous word-embedding model, which convert
words into vectors.
It is unsupervised learning approach where the model is given a text corpus or a
series of documents and the model learns an embedding for the words in this
Corpus
Statistical method for efficiently learning a standalone word embedding from a
text corpus.
Developed by Tomas Mikolov, et al. at Google in 2013
Has become the de facto standard for developing pre-trained word embedding.
The cool property about word2vec is how similar words have similar vectors
. so we can draw inference for different words based on Vector similarity
37
Habib Mrad – Introduction to NLP
Uses small neural networks to calculate word embeddings based on words’ local
context (window of neighboring words). There are 2 different approaches for
word2vec models:
CBOW works by taking a window of 5 words, hiding the word in the middle, and then
teaching a model to predict this word based on the 4 neighboring words. The CBOW
model learns the embeddings by predicting the current word based on its context
The continuous skip-Gram model works by taking a word and learning how to predict
the four neighboring words based on that word. The continuous skip-Gram model learns
by predicting the surrounding words based on the current word.
38
Habib Mrad – Introduction to NLP
39
Habib Mrad – Introduction to NLP
appearances next to each other. This expands the ability of the model to
separates words from each other and their meanings.
Gensim
Open source Python library for natural language processing
It is billed as “topic modeling for humans”
It is not a generic NLP framework like NLTK, but rather, a mature, focused, and
efficient suite of NLP tools for topic modeling. It is sometimes called topic
modeling for humans
Topic modelling involves extracting topic or the content of a certain text. It asks
the question: what are we talking about here?
It supports an implementation of Word2V model. So it is useful when we want to
train a custom word embedding model or load the pre-trained model for
inference.
40
Habib Mrad – Introduction to NLP
Word Embeddings
Objective: The goal from this exercise is to explore the Word2Vec technique for word
embeddings and introduce Stanford's GloVe embedding as well. The libraries we will
be using are Gensim for Word Embeddings Word2Vec and GloVe, matplotlib for
visualization and Scikit-Learn for Principle Component Analysis models, which
are used for reducing dimensionality.
size: (default 100) The number of dimensions of the embedding, e.g. the length of the
dense vector to represent each token (word).
window: (default 5) The maximum distance between a target word and words around the
target word.
min_count: (default 5) The minimum count of words to consider when training the
model; words with an occurrence less than this count will be ignored.
workers: (default 3) The number of threads to use while training.
sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).
41
Habib Mrad – Introduction to NLP
# We really want to find any word that occurs more than 1 time (min_count=
1). So count all the words that they appear at least one time
# train model
model = Word2Vec(sentences, min_count=1)
# summarize vocabulary
words = list(model.wv.vocab) # model.wv.vocab this is the vocab of the mod
el and we want to convert them to a list
print(words)
# save model
model.save('model.bin')
Results:
42
Habib Mrad – Introduction to NLP
# load model
new_model = Word2Vec.load('model.bin')
print(new_model['this', 'is'])
[[-2.6341430e-03 -6.8210840e-04 -2.2032256e-03 -2.9977076e-04
4.1053989e-03 -1.6542217e-03 -4.9963145e-04 1.7543014e-03
-1.2514311e-04 -4.7425432e-03 1.7437695e-03 2.6504786e-03
-9.4006537e-04 -1.4253150e-03 3.6435311e-03 -2.3904638e-03
-4.4562640e-03 -4.9993531e-03 -2.2515173e-03 2.3276228e-03
-3.3021998e-03 4.6024695e-03 3.0749412e-03 -4.2429226e-03
-1.5931168e-03 3.8181676e-03 2.4235758e-03 -3.4062283e-03
3.3555010e-03 1.2844945e-03 -2.7234193e-03 -8.0988108e-04
1.6813108e-03 -9.1353070e-04 4.4700159e-03 1.0303746e-03
2.9608165e-03 4.2942618e-03 -3.7393915e-03 -3.5502152e-03
-4.6064076e-03 -4.6758484e-03 4.1863527e-03 3.1615091e-03
4.5192647e-03 -9.1852510e-04 2.6027197e-03 -4.1375658e-03
-3.5489022e-03 -2.2965863e-03 7.5722620e-04 6.1148533e-04
2.1013829e-03 -3.9978596e-04 9.7986590e-04 4.9094297e-03
-2.1577110e-03 4.8989342e-03 3.8428246e-03 2.7198202e-03
-1.4638512e-03 -1.8885430e-03 -2.5638286e-04 -2.0531581e-03
-1.6875562e-03 -2.7395764e-03 5.5740983e-04 2.5572355e-03
5.6512235e-04 -2.9321280e-03 -7.5157406e-04 -1.2344553e-04
1.1710543e-03 -1.1974664e-03 -1.2816156e-03 -4.1503790e-03
4.5915483e-03 -9.5228269e-04 -3.4181602e-04 -2.3537560e-03
-2.6966461e-03 7.6452579e-04 -1.2867393e-03 1.2928311e-03
-3.9543938e-03 -4.6212925e-03 4.1194628e-03 4.6026148e-03
4.3687961e-04 -2.1079446e-03 4.3081054e-03 4.9309372e-03
-2.4515828e-03 -1.7825521e-04 2.4474061e-03 -2.4826333e-03
-4.6479427e-03 4.2938837e-03 -1.1659318e-04 -1.8699563e-03]
[-5.1402091e-04 4.0105209e-03 -6.2171964e-04 2.5609015e-03
-1.8050724e-03 4.9377461e-03 3.2770697e-03 1.8770240e-03
-4.9929242e-03 3.2758827e-03 -1.6969994e-03 -2.6876389e-03
2.9338847e-03 -4.3162694e-03 -4.2533325e-03 4.6241065e-03
-1.0224945e-03 2.1736498e-03 4.4645616e-03 -1.1554005e-03
-3.8527905e-03 -6.9526068e-05 2.0650621e-04 4.6414724e-03
1.1177687e-03 7.0009363e-04 3.9822836e-03 7.9727673e-05
-2.5550727e-04 1.5229684e-03 -3.3755524e-03 4.8815636e-03
43
Habib Mrad – Introduction to NLP
# train model
model = Word2Vec(sentences, min_count=1)
44
Habib Mrad – Introduction to NLP
Results:
app.launch_new_instance()
45
Habib Mrad – Introduction to NLP
Google Word2Vec
Instead of training your own word vectors (which requires a lot of RAM and compute
power), you can simply use a pre-trained word embedding. Google has published a pre-
trained Word2Vec model that was trained on Google news data (about 100 billion
words). It contains 3 million words and phrases and was fit using 300-dimensional
word vectors. It is a 1.53 Gigabyte file.
model = api.load("word2vec-google-news-300")
[==================================================] 100.0%
1662.8/1662.8MB downloaded
# filename = 'Googlenews-vectors-negative300.bin'
# model = KeyedVectors.load_word2vec_format(filename, binary=True)
# model = keyedvectors.load_word2vec_format(filename, binary=True)
46
Habib Mrad – Introduction to NLP
47
Habib Mrad – Introduction to NLP
[('queen', 0.7118192911148071)]
[('madrid', 0.5295541286468506)]
/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py:895:
FutureWarning: arrays to stack must be passed as a "sequence" type such as
list or tuple. Support for non-sequence iterables such as generators is
deprecated as of NumPy 1.16 and will raise an error in the future.
48
Habib Mrad – Introduction to NLP
# download
!wget http://nlp.stanford.edu/data/glove.6B.zip
49
Habib Mrad – Introduction to NLP
Archive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
total 2.9G
drwxr-xr-x 1 root root 4.0K Sep 16 13:45 .
drwxr-xr-x 1 root root 4.0K Sep 16 12:50 ..
drwxr-xr-x 4 root root 4.0K Sep 14 13:43 .config
-rw-rw-r-- 1 root root 332M Aug 4 2014 glove.6B.100d.txt
-rw-rw-r-- 1 root root 662M Aug 4 2014 glove.6B.200d.txt
-rw-rw-r-- 1 root root 990M Aug 27 2014 glove.6B.300d.txt
-rw-rw-r-- 1 root root 164M Aug 4 2014 glove.6B.50d.txt
-rw-r--r-- 1 root root 823M Oct 25 2015 glove.6B.zip
-rw-r--r-- 1 root root 22K Sep 16 13:03 model.bin
[('queen', 0.7698541283607483)
50