NLP Essentials
NLP Essentials
Contents
1
Short title of document Lecture 1 – Sentiment Analysis as your very first NLP application
3.5 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 A simple neural network with two hidden layers . . . . . . 20
3.5.2 Continuous Bag of Word: idea and implementation . . . . 21
Lecture 4 – Part-Of-Speech and Name-Entity-Recognition 21
4.1 Part-Of-Speech and Name-Entity-Recognition . . . . . . . . . . . . 21
4.1.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Hidden Markov Model for POS application . . . . . . . . . 21
4.1.3 Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . 22
Appendices – Python cheatsheets 22
5.1 cheatsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1 Introduction
Let 𝒮 be the set of all English sentences (string space). Let 𝑔 : 𝒮 → {0, 1} be the
sentiment function which assigns a value 0 (negative sentiment) or 1 (positive
sentiment) for a given sentence 𝑠.
Our ultimate goal is to establish a formula for 𝑔(𝑠) given an arbitrary sentence 𝑠.
Well, I think the first question to ask is what we should start from?
A very first idea comes to mind is that, instead of figuring out some "sin-
gular" functions taking only discrete values 0 and 1, we will try to determine
a "smoother" function which compute the probability for being sentimentally
positive 𝑔¯ : 𝒮 → [0, 1]. Then we can make some rule such as 𝑔¯ (𝑠) surpasses the
threshold 0.5 means a positive sentiment i.e. 𝑔(𝑠) = 1.
First linguistic intuition and first assumption: Each word of natural language
has its own sentiment and the sentiment of a sentence can be determined by that
of all words in this sentence.
Õ
𝑔(𝑠) = 𝑔(𝑤). (1.1)
𝑤∈𝑠
2
Short title of document 1.3 Implementation with Logistic Regression Model
import numpy as np
from auxiliary_functions import sigmoid, process_tweet, gradient_descent_logistic
import nltk
from nltk.corpus import twitter_samples
import pickle
import streamlit as st
return freqs
3
Short title of document 1.5 Implementation with Naive Bayes Model
word_l =process_tweet(tweet)
def tweet_sentiment_analysis_runner():
# get the trainset, testset from twitter samples
# nltk.download(’twitter_samples’)
# nltk.download(’stopwords’)
all_positive_tweets =twitter_samples.strings(’positive_tweets.json’)
all_negative_tweets =twitter_samples.strings(’negative_tweets.json’)
# split data into train and test set
test_pos =all_positive_tweets[4000:]
train_pos =all_positive_tweets[:4000]
test_neg =all_negative_tweets[4000:]
train_neg =all_negative_tweets[:4000]
train_x =train_pos +train_neg
test_x =test_pos +test_neg
4
Short title of document 1.5 Implementation with Naive Bayes Model
# collect the features ’x’ and stack them into a matrix ’X’
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]=extract_features(train_x[i], freqs)
5
Short title of document 1.5 Implementation with Naive Bayes Model
return y_pred
def streamlit_sentiment_analysis():
st.title("Sentiment Analysis for Tweets")
tweet = st.text_input("Enter a tweet:")
rerun_option =st.selectbox ("Do you want to rerun the model", ["Yes", "No"])
if rerun_option =="Yes":
loss, weights, freqs =tweet_sentiment_analysis_runner()
else:
with open(’logistic_regression_model.pkl’, ’rb’) as file:
logistic_params =pickle.load(file)
weights =logistic_params[’weights’]
loss = logistic_params[’loss_history’]
freqs = logistic_params[’frequencies’]
if __name__ =="__main__":
6
Short title of document Lecture 2 – N-gram Language Model
2.1 Introduction
Let’s start this chapter by talking a bit of how we started to learn our mother
tongue when we were a child.
A child’s journey to learn to speak and acquire their mother tongue is a
fascinating and complex process. Passing through the pre-linguistic phase with
a lot of babbling sounds during the very first year, a child often begin to say
their single first words like "mum", "dad", etc. Around 18-24 months, children
start combining two words to form simple sentences ("more juice" for instance).
We call it the two-word stage. Then, they go into Telegraphic Speech phase and
start to form simple sentences, usually with 3-4 words. And fascinating enough,
they start to understand and use a bit of their mother tongue’s grammar.
That’s how a child naturally understand and use a language (to communicate
with people) and that’s the nature of language itself. It always has some sort of
ordering structure due to standard grammar rules, the community style (American
English has different style in comparison with British English for example), etc.
Hence, naturally, certain sequences of words appear more often than others.
7
Short title of document 2.2 Mathematical formulation for n-gram models
• n-gram can be seen as the most basic language model that can generates text
in form of multiple n-gram sequences.
From (2.1), the computation goes down to, for each 𝑖, determine the occurence
probability of a single word 𝑤 𝑖 given the whole history i.e. the 𝑖 − 1 previous
words. Come up with an explicit formula to compute this conditional probability
seems to be an impossible mission. This is when we need some extra assumptions
(hypotheses) to simplify our need-to-compute objects.
Markov assumption: The occurence of 𝑤 𝑖 depends only on 𝑁 preceding
words i.e. P(𝑤 𝑖 |𝑤 1:𝑖−1 ) = P(𝑤 𝑖 |𝑤 𝑖−𝑁+1:𝑖−1 ). For unigram model i.e. 𝑁 = 1,
P(𝑤 𝑖 |𝑤1:𝑖−1 ) = P(𝑤 𝑖 ), the occurence probability of the word 𝑤 𝑖 itself. In the uni-
gram world, all words are independent of each other. For bigram model (𝑁 = 2),
P(𝑤 𝑖 |𝑤1:𝑖−1 ) = P(𝑤 𝑖 |𝑤 𝑖−1 ) i.e. the occurence of the word 𝑤 𝑖 only depends on the
preceding word 𝑤 𝑖−1 .
8
Short title of document 2.2 Mathematical formulation for n-gram models
Let’s consider the bigram model. A natural candidate for P(𝑤 𝑖 |𝑤 1:𝑖−1 ) formula
is
𝐶(𝑤 𝑖−1 𝑤 𝑖 ) 𝐶(𝑤 𝑖−1 𝑤 𝑖 )
P(𝑤 𝑖 |𝑤 𝑖−1 ) = Í = (2.2)
𝑤∈𝑉 𝐶(𝑤 𝑖−1 𝑤) 𝐶(𝑤 𝑖−1 )
where 𝐶(𝑥) is the counting function of the word sequence 𝑥 with respect to some
training corpus 𝑉 i.e. 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) is the number of occurences of the sequence (bi-
gram) 𝑤 𝑖−1 𝑤 𝑖 appearing in 𝑉; 𝐶(𝑤 𝑖 ) is the number of occurences of the unigram
𝑤 𝑖 appearing in 𝑉.
Hence, in bigram case
𝑛 𝑛 𝑛
Ö Ö 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) Ö 𝐶(𝑤 𝑤 )
𝑖−1 𝑖
P(𝑤1:𝑛 ) = P(𝑤 𝑖 |𝑤 𝑖−1 ) = = . (2.3)
𝑤∈𝑉 𝐶(𝑤 𝑖−1 𝑤) 𝐶(𝑤 𝑖−1 )
Í
𝑖=1 𝑖=1 𝑖=1
<𝑠>𝑎 𝑏
<𝑠>𝑎 𝑎
<𝑠>𝑏 𝑎
<𝑠>𝑏 𝑏
One the one hand, it is easy to see that the unigram count 𝐶(𝑎) is equal to 4.
One the other hand, 𝑤∈𝑉 𝐶(𝑎𝑤) is actually equal to 2. Hence, (2.3) seems not
Í
correct.
9
Short title of document 2.3 Zero probability problem and smoothing techniques
Moreover, we will show that the sum of all 2-word sentences over {𝑎, 𝑏} is
equal to 1. Indeed,
Take a closer look into our prediction function for P(𝑤 1 𝑤2 . . . 𝑤 𝑛 ), there are two
serious problems that we should deal with
• Out of vocab (OOV) words. If in the test set, we encounter some word which
does not appear at all in the training corpus, then the unigram count for this
word becomes 0 and (2.1) becomes undefined.
• Since our prediction function is the product of many terms, we do not wish
to have a zero term that just kill the whole thing. This situation occurs when
a new association 𝑤 𝑖−1 𝑤 𝑖 appears in the test set. Since this bigram does not
10
Short title of document 2.3 Zero probability problem and smoothing techniques
appear in the training corpus, this term becomes 0 and so is the prediction
result.
In prediction phase, given a sentence, we can mark any new word (does not
appear in training corpus) by converting them into a special symbol < 𝑈 𝑁 𝐾 >
(standing for UNKNOWN). For example, consider the sentence "TRINH TUAN
PHONG is teaching the NLP specialization in 2024". It’s very likely that the
model has absolutely no idea who I am so the sentence will be converted into
"< 𝑈 𝑁 𝐾 >< 𝑈 𝑁 𝐾 >< 𝑈 𝑁 𝐾 > is teaching the NLP specialization in 2024".
Now, a natural question araises: what is the occurence probability of the
word < 𝑈 𝑁 𝐾 >? Since < 𝑈 𝑁 𝐾 > is just a fictive symbol that we have just made,
we have to incorporate < 𝑈 𝑁 𝐾 > in training corpus in some way and compute
its occurence probability. There are probably two common ways to do so.
• In case that we do not have a fixed of vocabulary in our hand, another way to
assign a probability to < 𝑈 𝑁 𝐾 > is fixing a min frequency 𝑚. All words that
occur less frequently than 𝑚 will be converted into < 𝑈 𝑁 𝐾 >.
2.3.2 Smoothing
To handle the zero factor problem, we make use of various smoothing tech-
niques. Probably the most common smoothing to many of us is Laplace (add-
one) smoothing technique.
Assume that the training corpus has 𝑁 tokens in total with vocal 𝑉 (the set of
unique words in the training corpus). Then, in unigram model, the probability
of the word 𝑤 𝑖 is
𝐶(𝑤 𝑖 )
P(𝑤 𝑖 ) = .
𝑁
Now, if we apply Laplace smoothing, which means that we simply increment
the frequency of each word in the vocabulary by 1. Hence,
𝐶(𝑤 𝑖 ) + 1
P𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (𝑤 𝑖 ) = . (2.5)
𝑁 + |𝑉|
11
Short title of document Lecture 3 – Semantic meaning and word embedding
Note that (2.5) assign a factor of 1/(𝑁 +|𝑉|) for any new word from a sentence
in unigram model.
Similarly, in bigram model,
𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 1 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 1
P𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (𝑤 𝑖 |𝑤 𝑖−1 ) = Í = . (2.6)
𝑤∈𝑉 (𝐶(𝑤 𝑖−1 𝑤) + 1) 𝐶(𝑤 𝑖−1 ) + |𝑉|
¸ Vector Embedding
3.1 Introduction
12
Short title of document 3.1 Introduction
observation leads to the idea of analyzing how words co-occur together or how
they are distributed in our language to capture the meaning of words.
Another great idea from linguists such as Osgood et al. in 1957 is determining
a word connotation via certain factors (c.f.)
13
Short title of document 3.2 Co-occurrence matrices
ferent contexts.
• Consider again the document searching problem. Popular words might ap-
pear million times in each document but they are not helpful at all for us to
14
Short title of document 3.4 Pointwise Mutual Information
distinguish a document from the rest of the collection (just like stopwords).
Here we take the log as a good routine to avoid numerical instabilities (overflow,
precision loss). We add 1 to handle the case where the count is equal to 0.
Next, let 𝑁 be the number of docs in the collection. Then
𝑑𝑓 (𝑤 𝑖 )
= {number of docs that contain the word 𝑤 𝑖 }/𝑁 . (3.2)
𝑁
is the document frequency w.r.t. the word 𝑤 𝑖 from the collection. Then, we can
define the Inverse Document Frequency (idf) as follows
𝑁
𝑖𝑑𝑓 (𝑤 𝑖 ) = 𝑖𝑑𝑓 (𝑤 𝑖 , collection_size) = log10 . (3.3)
𝑑𝑓 (𝑤 𝑖 )
The quantity 𝑖𝑑𝑓 (𝑤 𝑖 ) varies from 0 to +∞, take values close to 0 for undiscrimi-
native (not important) words and the values of discriminative (important) words
will be large.
Hence,
Our readers are encouraged to replace the raw frequency by above formulation
and re-do the computation in Section 3.2.
15
Short title of document 3.4 Pointwise Mutual Information
Then, to measure how often two words occur together in the same environment
(neighborhood), we can consider the following so-called Mutual Information
(MI):
P(𝑥, 𝑦)
𝑀𝐼(𝑥, 𝑦) = log2 (3.5)
P(𝑥)P(𝑦)
which compares the difference between the joint-occurrence and the indepen-
dent one. In other words, Mutual Information measures how much one random
variable tells us about another i.e. how much the appearence of one word pre-
dicts that of, let’s say, the next word.
Mutual Information is a very interesting concept of Information Theory that de-
serves a little more explanation. It show a nice relation between KullBack-Leibler
divergence, Entropy and Mutual Information.
3.4.1 Entropy
Definition 3.1. Let 𝑋 be a discrete random variale that takes values in 𝒳 and
let 𝑝(𝑥) := P(𝑋 = 𝑥). The entropy 𝐻(𝑋) measures the degree of uncertainty or
randomness carried by 𝑋.
Precisely,
• Let 𝐼(𝑋) := − log𝑏 (𝑝(𝑋)) = log𝑏 (1/𝑝(𝑋)) the information content of 𝑋 for
certain base 𝑏. For a rare event 𝑋 = 𝑥 with small 𝑝(𝑥), 𝐼(𝑋 = 𝑥) is a huge
value. For a very common event which happens all the time i.e. 𝑝(𝑥) is close
to 0, 𝐼(𝑋 = 𝑥) tends to 0. That means a usual information contributes less to
the degree of uncertainty from Information Theory perspective.
• There are three common values of the base 𝑏 that we use: 𝑏 = 2 gives the
units of bits (shannons); base 10 gives units of dits; base 𝑒 gives natural units
nat. Throughout the chapter, we omit the subscript 𝑏 whenever we use the
base 2.
Definition 3.2. For a pair of random variable (𝑋 , 𝑌), we can naturally define a
so-called joint entropy 𝐻(𝑋 , 𝑌) as follows.
Õ ÕÕ
𝐻(𝑋 , 𝑌) = − 𝑝(𝑥, 𝑦) log𝑏 (𝑝(𝑥, 𝑦)) = − 𝑝(𝑥, 𝑦) log𝑏 (𝑝(𝑥, 𝑦))
(𝑥,𝑦)∈(𝒳 ,𝒴 ) 𝑥∈𝒳 𝑦∈𝒴
16
Short title of document 3.4 Pointwise Mutual Information
It is not too hard to see that the joint entropy can not surpass the sum of two
marginal (individual) entropies i.e. 𝐻(𝑋 , 𝑌) ≤ 𝐻(𝑋) + 𝐻(𝑌).
Example. Assume that you want to send some encrypted message by using a
sequence of 4 characters ’A’, ’B’, ’C’, and ’D’ over a binary channel. How many
bits that you need to transmit the message?
• If all 4 letters are equally likely (25%), one cannot do better than using two
bits to encode each letter (entropy is equal to 2 bits). ’A’ might code as ’00’,
’B’ as ’01’, ’C’ as ’10’, and ’D’ as ’11’.
• However, if the probabilities of each letter are unequal, say ’A’ occurs with
70% probability, ’B’ with 26%, and ’C’ and ’D’ with 2% each, one could assign
variable length codes. In this case, ’A’ would be coded as ’0’, ’B’ as ’10’, ’C’ as
’110’, and ’D’ as ’111’. With this representation, 70% of the time only one bit
needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On
average, fewer than 2 bits are required since the entropy is lower.
Example. English text, treated as a string of characters, has fairly low entropy;
i.e. it is fairly predictable. After the first few letters one can often guess the rest
of the word. English text has between 0.6 and 1.3 bits of entropy per character
of the message.
Definition 3.3. Given two discrete random variables 𝑋 and 𝑌 with support
sets 𝒳 and 𝒴 , conditional entropy 𝐻(𝑋|𝑌) is the quantity that measures the
remain randomness (uncertainty) in 𝑋 given information from 𝑌. In other
words, 𝐻(𝑋|𝑌) measures how much information 𝑌 encodes about 𝑋.
𝑝(𝑥, 𝑦)
Õ
𝐻(𝑋|𝑌) := − 𝑝(𝑥, 𝑦) log𝑏 (3.7)
𝑝(𝑦)
(𝑥,𝑦)∈(𝒳 ,𝒴 )
17
Short title of document 3.4 Pointwise Mutual Information
2. 𝐻(𝑋|𝑌) ≤ 𝐻(𝑋) for any 𝑋 , 𝑌. The equality holds iff 𝑋 and 𝑌 are two
independent variables.
𝑝(𝑥, 𝑦)
Õ
𝑀𝐼(𝑋 , 𝑌) := − 𝑝(𝑥, 𝑦) log𝑏 (3.8)
𝑝(𝑥)𝑝(𝑦)
(𝑥,𝑦)∈(𝒳 ,𝒴 )
where 𝑝(𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦), 𝑝(𝑥) = P(𝑋 = 𝑥) and 𝑝(𝑦) = P(𝑌 = 𝑦).
The second property of Mutual Information gives a nice interpretation for the
concept. 𝑀𝐼(𝑋 , 𝑌) can be seen as the residual uncertainty of 𝑋 after conditioning
𝑋 on 𝑌. Hence, Mutual Information measures how much knowing one random
variable reduces the uncertainty of the other.
Moreover, it also infers that 𝑀𝐼(𝑋 , 𝑌) coincides with Information Gain, a familar
concept that we use in Decision Tree where 𝑋 plays the role of label and 𝑌 is a
specific feature (we choose the feature which maximize the Information Gain to
split).
In terms of comparing a joint probability with the product of probability, the
concept Mutual Information can be linked naturally to the Kullback-Leibler
divergence (KL statistical distance):
where P(𝑋 ,𝑌) is the joint probability and P𝑋 ⊗ P𝑌 is the tensor product which
assigns the probability 𝑝(𝑥) ∗ 𝑝(𝑦) for each (𝑥, 𝑦).
18
Short title of document 3.5 Word2vec
3.5 Word2vec
The (static) embedding with tf-idf and PPMI that we have seen in Section 3.3
and 3.4 have certain criticial flaws
• Embedding vectors generated by these methods are sparse and living a huge
dimensional space (the size of vocab 𝑉). Hence, it is not really practical to
use.
• Assume that we have such a huge, sparse vector embedding 𝑢 of vocab size
𝑉. Mathematically speaking, we want to reduce the size of 𝑢 by applying
a linear transformation on 𝑢 to get 𝑣 = 𝐴𝑢 where 𝐴 ∈ ℳ(𝑁 , 𝑉) with 𝑁 is
relatively much smaller than 𝑉. In terms of Deep Learning language, it is
equivalent to passing this vector to a neural network with some hidden size
𝑁.
• How about the learning process that such neural network will carry to make
that 𝑁−dimensional vector capture the semantic meaning of a word or a
document?
• The idea introduced by scientists from Google in 2013 is to pick a word as the
target (centered word) and let the neural network take its context words (the
surrounding words) as the input and predict the target. Just like the game
where you have an incomplete sentence "I ... to school" and your machine try
to get the word "go".
19
Short title of document 3.5 Word2vec
• It sound pretty reasonable so far, right. Now the only question left is what
kind of initial embedding as the input that we will use?
𝑒 𝑤𝑖
𝜎𝑖 (𝑤) = Í𝑉 .
𝑤𝑗
𝑗=1 𝑒
𝜕𝜎𝑖
= 𝜎𝑖 (𝛿 𝑖𝑗 − 𝜎 𝑗 ) (3.10)
𝜕𝑤 𝑗
20
Short title of document Lecture 4 – Part-Of-Speech and Name-Entity-Recognition
• determine the transition probability P(𝑡 𝑖 , 𝑡 𝑗 ) i.e. the probability that the tag 𝑡 𝑗
appears next to 𝑡 𝑖 .
𝐶(𝑡 𝑖 , 𝑡 𝑗 ) + 𝛼 𝐶(𝑡 𝑖 , 𝑡 𝑗 ) + 𝛼
P(𝑡 𝑖 |𝑡 𝑗 ) = Í𝑁 = (4.1)
𝐶(𝑡 𝑖 ) + 𝛼𝑁
𝑘=1 𝐶(𝑡 𝑖 , 𝑡 𝑘 ) + 𝛼𝑁
where 𝜀 > 0 is the smoothing constant; 𝐶(𝑡 𝑖 ) is the occurence number of the
tag 𝑡 𝑖 in the training corpus.
• To compute the initial probability, we add the start of sentence symbol < 𝑠 >
at the beginning of each sentence.
Emission matrix
•
𝐶(𝑡 𝑖 , 𝑤 𝑗 ) + 𝛽 𝐶(𝑡 𝑖 , 𝑤 𝑗 ) + 𝛽
P(𝑤 𝑗 |𝑡 𝑖 ) = Í = (4.2)
𝑤∈𝑉 (𝐶(𝑡 𝑖 , 𝑤 𝑗 )) + 𝛽|𝑉| 𝐶(𝑡 𝑖 ) + 𝛽|𝑉|
21
Short title of document Appendices – Python cheatsheets
where 𝐶(𝑡 𝑖 , 𝑤 𝑗 ) is the number of times that the word 𝑤 𝑗 is associated with the
tag 𝑡 𝑖 in the training corpus.
¸ Python cheatsheets
5.1 cheatsheet
References
22