0% found this document useful (0 votes)
10 views22 pages

NLP Essentials

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

NLP Essentials

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NLP Specialization

Fundamental concepts in NLP with practical Python illustration and


engineering approach

Advanced Developments of NLP with a focus on applications in Finance

TRINH TUAN PHONG

Last updated: October 11, 2024

Contents

Lecture 1 – Sentiment Analysis as your very first NLP application 2


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Linguistic intuition and first mathematical formulation . . . . . . 2
1.3 Implementation with Logistic Regression Model . . . . . . . . . . 3
1.4 Another approach and second mathematical formulation . . . . . 3
1.5 Implementation with Naive Bayes Model . . . . . . . . . . . . . . 3
Lecture 2 – N-gram Language Model 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Mathematical formulation for n-gram models . . . . . . . . . . . . 8
2.3 Zero probability problem and smoothing techniques . . . . . . . . 10
2.3.1 Handling OOV words . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Lecture 3 – Semantic meaning and word embedding 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Co-occurrence matrices . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Term frequency-Inverse Document Frequency . . . . . . . . . . . . 14
3.4 Pointwise Mutual Information . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.2 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . 17
3.4.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 18

1
Short title of document Lecture 1 – Sentiment Analysis as your very first NLP application

3.5 Word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.1 A simple neural network with two hidden layers . . . . . . 20
3.5.2 Continuous Bag of Word: idea and implementation . . . . 21
Lecture 4 – Part-Of-Speech and Name-Entity-Recognition 21
4.1 Part-Of-Speech and Name-Entity-Recognition . . . . . . . . . . . . 21
4.1.1 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Hidden Markov Model for POS application . . . . . . . . . 21
4.1.3 Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . 22
Appendices – Python cheatsheets 22
5.1 cheatsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

¸ Sentiment Analysis as your very first NLP application

1.1 Introduction

1.2 Linguistic intuition and first mathematical formulation

Let 𝒮 be the set of all English sentences (string space). Let 𝑔 : 𝒮 → {0, 1} be the
sentiment function which assigns a value 0 (negative sentiment) or 1 (positive
sentiment) for a given sentence 𝑠.
Our ultimate goal is to establish a formula for 𝑔(𝑠) given an arbitrary sentence 𝑠.
Well, I think the first question to ask is what we should start from?
A very first idea comes to mind is that, instead of figuring out some "sin-
gular" functions taking only discrete values 0 and 1, we will try to determine
a "smoother" function which compute the probability for being sentimentally
positive 𝑔¯ : 𝒮 → [0, 1]. Then we can make some rule such as 𝑔¯ (𝑠) surpasses the
threshold 0.5 means a positive sentiment i.e. 𝑔(𝑠) = 1.
First linguistic intuition and first assumption: Each word of natural language
has its own sentiment and the sentiment of a sentence can be determined by that
of all words in this sentence.
Õ
𝑔(𝑠) = 𝑔(𝑤). (1.1)
𝑤∈𝑠

2
Short title of document 1.3 Implementation with Logistic Regression Model

1.3 Implementation with Logistic Regression Model

1.4 Another approach and second mathematical formulation

1.5 Implementation with Naive Bayes Model

import numpy as np
from auxiliary_functions import sigmoid, process_tweet, gradient_descent_logistic
import nltk
from nltk.corpus import twitter_samples
import pickle
import streamlit as st

def build_freqs(tweets, ys):


""" Build frequencies
Input:
tweets: a list of tweets
ys: an mx1 array with the sentiment label of each tweet (either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its frequency
"""
yslist =np.squeeze(ys).tolist()
# start with an empty dict and populate it by looping over all tweets
freqs ={}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] +=1
else:
freqs[pair] =1

return freqs

def extract_features(tweet, freqs, process_tweet=process_tweet):


’’’
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word,
label)
Output:
x: a feature vector of dimension (1,3)
’’’
# process_tweet tokenizes, stems, and removes stopwords

3
Short title of document 1.5 Implementation with Naive Bayes Model

word_l =process_tweet(tweet)

# 3 elements in the form of a 1 x 3 vector


x = np.zeros((1, 3))

#bias term is set to 1


x[0,0] =1
# loop through each word in the list of words
for word in word_l:

# increment the word count for the positive label 1


if (word, 1) in freqs.keys():
x[0,1] +=freqs[(word, 1)]

# increment the word count for the negative label 0


if (word, 0) in freqs.keys():
x[0,2] +=freqs[(word, 0)]

assert(x.shape ==(1, 3))


return x

def tweet_sentiment_analysis_runner():
# get the trainset, testset from twitter samples
# nltk.download(’twitter_samples’)
# nltk.download(’stopwords’)
all_positive_tweets =twitter_samples.strings(’positive_tweets.json’)
all_negative_tweets =twitter_samples.strings(’negative_tweets.json’)
# split data into train and test set
test_pos =all_positive_tweets[4000:]
train_pos =all_positive_tweets[:4000]
test_neg =all_negative_tweets[4000:]
train_neg =all_negative_tweets[:4000]
train_x =train_pos +train_neg
test_x =test_pos +test_neg

# Create the numpy array of positive labels and negative labels.


train_y =np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)),
axis=0)
test_y =np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)),
axis=0)

# create frequency dictionary


freqs = build_freqs(train_x, train_y)

4
Short title of document 1.5 Implementation with Naive Bayes Model

# collect the features ’x’ and stack them into a matrix ’X’
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :]=extract_features(train_x[i], freqs)

# training labels corresponding to X


Y = train_y

# Apply gradient descent


loss, weights =gradient_descent_logistic(X, Y, np.zeros((3, 1)), 1e-9, 10000)
print(f"The cost after training is {loss:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in
np.squeeze(weights)]}")

# save the model parameters result


# np.savez(’logistic_result.npz’, array1 = J, array2 = w, array3 = freqs)
save_data ={
’model’: ’logistic’,
’weights’: weights,
’loss_history’: loss,
’frequencies’: freqs}

with open(’logistic_regression_model.pkl’, ’wb’) as file:


pickle.dump(save_data, file)

print("logistic regression parameters are saved successfully.")

return loss, weights, freqs

def predict_tweet(tweet, freqs, theta):


’’’
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word,
label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
’’’
# extract the features of the tweet and store it into x
x = extract_features(tweet, freqs)
# make the prediction using x and theta
y_pred =sigmoid(np.dot(x, theta))

5
Short title of document 1.5 Implementation with Naive Bayes Model

return y_pred

def streamlit_sentiment_analysis():
st.title("Sentiment Analysis for Tweets")
tweet = st.text_input("Enter a tweet:")

rerun_option =st.selectbox ("Do you want to rerun the model", ["Yes", "No"])
if rerun_option =="Yes":
loss, weights, freqs =tweet_sentiment_analysis_runner()
else:
with open(’logistic_regression_model.pkl’, ’rb’) as file:
logistic_params =pickle.load(file)
weights =logistic_params[’weights’]
loss = logistic_params[’loss_history’]
freqs = logistic_params[’frequencies’]

st.write(f"The cost after training is {loss:.8f}.")


st.write(f"The resulting vector of weights is {[round(t, 8) for t in
np.squeeze(weights)]}")
y_pred =predict_tweet(tweet, freqs, weights)
st.write(f"The prediction is {y_pred}")
if np.squeeze(y_pred) >0.5:
st.write("Positive sentiment")
else:
st.write("Negative sentiment")

# # File uploader for .txt file


# uploaded_file = st.file_uploader("group_divided.txt", type="txt")

# if uploaded_file is not None:


# # Read the .txt file content
# text = uploaded_file.read().decode("utf-8")

# # Display the content of the file


# st.write("File Content:")
# st.text_area("", text, height=300)

if __name__ =="__main__":

6
Short title of document Lecture 2 – N-gram Language Model

print("run Sentiment Analysis solution for tweets !!!")


# loss, weights, freqs = tweet_sentiment_analysis_runner()
# 3. Load the saved model and objects
# with open(’logistic_regression_model.pkl’, ’rb’) as file:
# logistic_params = pickle.load(file)

# # Extract the objects from the dictionary


# model_name = logistic_params[’model’]
# weights = logistic_params[’weights’]
# loss = logistic_params[’loss_history’]
# freqs = logistic_params[’frequencies’]

# print("Model and objects loaded successfully.")


# print("Model name: ", model_name)
# print(f"Weights: {weights}")
# print(f"Loss History: {loss}")
# print(f"Frequencies: {freqs}")
# print("Finish the run !!!")
streamlit_sentiment_analysis()

¸ N-gram Model as the very first Language Model

2.1 Introduction

Let’s start this chapter by talking a bit of how we started to learn our mother
tongue when we were a child.
A child’s journey to learn to speak and acquire their mother tongue is a
fascinating and complex process. Passing through the pre-linguistic phase with
a lot of babbling sounds during the very first year, a child often begin to say
their single first words like "mum", "dad", etc. Around 18-24 months, children
start combining two words to form simple sentences ("more juice" for instance).
We call it the two-word stage. Then, they go into Telegraphic Speech phase and
start to form simple sentences, usually with 3-4 words. And fascinating enough,
they start to understand and use a bit of their mother tongue’s grammar.
That’s how a child naturally understand and use a language (to communicate
with people) and that’s the nature of language itself. It always has some sort of
ordering structure due to standard grammar rules, the community style (American
English has different style in comparison with British English for example), etc.
Hence, naturally, certain sequences of words appear more often than others.

7
Short title of document 2.2 Mathematical formulation for n-gram models

Mathematically speaking, they appear with a higher probability than others.


Such very natural observation can be a good starting point for comprehending
Language Models (LMs).
A language model is a (Machine Learning) model that assigns a probability
distribution for the occurence of sequences of symbols in a language. The
symbols may be letters, syllables or words in natural languages. For the sake of
simplicity, we only consider words as symbols throughout this chapter.

• An n-gram is a sequence of n adjacent symbols in certain order.

• An n-gram model is the model that predicts n-gram sequences. As an abuse


of language, we use the term "n-gram" for either the sequence or the predictive
model.

• n-gram can be seen as the most basic language model that can generates text
in form of multiple n-gram sequences.

2.2 Mathematical formulation for n-gram models

Assume that we want to compute the occurence probability P(𝑤1 𝑤2 . . . 𝑤 𝑛 ) of


a sequence with n words 𝑤1 , 𝑤2 , . . . , 𝑤 𝑛 in a certain corpus. We will use the
expression 𝑤 1:𝑛 as the shorthand of this sequence. Now, by the chain rule for
conditional probability, we have
𝑛
Ö
P(𝑤1:𝑛 ) = P(𝑤1 )P(𝑤2 |𝑤1 ) . . . P(𝑤 𝑛 |𝑤1:𝑛−1 ) = P(𝑤 𝑖 |𝑤1:𝑖−1 ). (2.1)
𝑖=1

From (2.1), the computation goes down to, for each 𝑖, determine the occurence
probability of a single word 𝑤 𝑖 given the whole history i.e. the 𝑖 − 1 previous
words. Come up with an explicit formula to compute this conditional probability
seems to be an impossible mission. This is when we need some extra assumptions
(hypotheses) to simplify our need-to-compute objects.
Markov assumption: The occurence of 𝑤 𝑖 depends only on 𝑁 preceding
words i.e. P(𝑤 𝑖 |𝑤 1:𝑖−1 ) = P(𝑤 𝑖 |𝑤 𝑖−𝑁+1:𝑖−1 ). For unigram model i.e. 𝑁 = 1,
P(𝑤 𝑖 |𝑤1:𝑖−1 ) = P(𝑤 𝑖 ), the occurence probability of the word 𝑤 𝑖 itself. In the uni-
gram world, all words are independent of each other. For bigram model (𝑁 = 2),
P(𝑤 𝑖 |𝑤1:𝑖−1 ) = P(𝑤 𝑖 |𝑤 𝑖−1 ) i.e. the occurence of the word 𝑤 𝑖 only depends on the
preceding word 𝑤 𝑖−1 .

8
Short title of document 2.2 Mathematical formulation for n-gram models

Let’s consider the bigram model. A natural candidate for P(𝑤 𝑖 |𝑤 1:𝑖−1 ) formula
is
𝐶(𝑤 𝑖−1 𝑤 𝑖 ) 𝐶(𝑤 𝑖−1 𝑤 𝑖 )
P(𝑤 𝑖 |𝑤 𝑖−1 ) = Í = (2.2)
𝑤∈𝑉 𝐶(𝑤 𝑖−1 𝑤) 𝐶(𝑤 𝑖−1 )
where 𝐶(𝑥) is the counting function of the word sequence 𝑥 with respect to some
training corpus 𝑉 i.e. 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) is the number of occurences of the sequence (bi-
gram) 𝑤 𝑖−1 𝑤 𝑖 appearing in 𝑉; 𝐶(𝑤 𝑖 ) is the number of occurences of the unigram
𝑤 𝑖 appearing in 𝑉.
Hence, in bigram case
𝑛 𝑛 𝑛
Ö Ö 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) Ö 𝐶(𝑤 𝑤 )
𝑖−1 𝑖
P(𝑤1:𝑛 ) = P(𝑤 𝑖 |𝑤 𝑖−1 ) = = . (2.3)
𝑤∈𝑉 𝐶(𝑤 𝑖−1 𝑤) 𝐶(𝑤 𝑖−1 )
Í
𝑖=1 𝑖=1 𝑖=1

A more general formula for 𝑁-gram can be easily obtained as follows

𝐶(𝑤 𝑖−𝑁+1:𝑖−1 𝑤 𝑖 ) 𝐶(𝑤 𝑖−𝑁+1:𝑖 )


P(𝑤 𝑖 |𝑤1:𝑖−1 ) = = . (2.4)
𝐶(𝑤 𝑖−𝑁+1:𝑖−1 ) 𝐶(𝑤 𝑖−𝑁+1:𝑖−1 )
which is the ratio between the number of occurences of 𝑁-gram 𝑤 𝑖−𝑁+1:𝑖 and
that of (𝑁 − 1)-gram 𝑤 𝑖−𝑁+1:𝑖−1 .
There are certain subtlety in the implementation of count function that we
should take into account in n-gram models. For the sake of simplicity, let’s
comeback to bigram case.
First of all, we observe that the first word of the sentence has nothing to
condition on. In this case, it will be convenient to add a special symbol ⟨𝑠⟩ to
each sentence as the start of the sentence and compute all bigrams. For 𝑁-gram,
we need to add (𝑁 − 1) ⟨𝑠⟩. Next, still remember what we mentioned about
Language Models. They give a probability distribution to the set of sentences.
That means the probabilities of all sentences must sum to 1. Let’s check if the
condition holds with our formula by considering the following corpus 𝑉 built
on two letters {𝑎, 𝑏}

<𝑠>𝑎 𝑏
<𝑠>𝑎 𝑎
<𝑠>𝑏 𝑎
<𝑠>𝑏 𝑏

One the one hand, it is easy to see that the unigram count 𝐶(𝑎) is equal to 4.
One the other hand, 𝑤∈𝑉 𝐶(𝑎𝑤) is actually equal to 2. Hence, (2.3) seems not
Í

correct.

9
Short title of document 2.3 Zero probability problem and smoothing techniques

Moreover, we will show that the sum of all 2-word sentences over {𝑎, 𝑏} is
equal to 1. Indeed,

𝐶(< 𝑠 > 𝑎) 𝐶(𝑎𝑎)


P(< 𝑠 > 𝑎𝑎) = P(𝑎| < 𝑠 >)P(𝑎|𝑎) = Í ∗Í
𝑤∈𝑉 𝐶(< 𝑠 > 𝑤) 𝑤∈𝑉 𝐶(𝑎𝑤)
2 1
= ∗ = 1/4.
4 2
The same result goes for P(< 𝑠 > 𝑏𝑏), P(< 𝑠 > 𝑎𝑏) and P(< 𝑠 > 𝑏𝑎).
By a similar computation, we can prove that the probability of all possible 𝑘
word sentences over the alphabet 𝑎, 𝑏 is equal to 1 for any 𝑘 ≥ 2. However, as
far as LMs concern, we need the sum of all probabilities of all sequences must
sum up to 1.
This is where we need another special symbol end-of-sentence < /𝑠 > for
each sentence in the training corpus. Our readers are encouraged to check that,
in the modified corpus 𝑉𝑏𝑖𝑠

<𝑠>𝑎 𝑏 < /𝑠 >


<𝑠>𝑏 𝑏 < /𝑠 >
<𝑠>𝑏 𝑎 < /𝑠 >
<𝑠>𝑎 𝑎 < /𝑠 >

the bigram or any 𝑛−gram model assigns a (unique) probability distribution.


To sum up, if our above explanation was convincing enough, every sentence
in any training corpus should be formatted as < 𝑠 > sentence < /𝑠 > and we are
confident that our LM functions as it should be.

2.3 Zero probability problem and smoothing techniques

Take a closer look into our prediction function for P(𝑤 1 𝑤2 . . . 𝑤 𝑛 ), there are two
serious problems that we should deal with

• Out of vocab (OOV) words. If in the test set, we encounter some word which
does not appear at all in the training corpus, then the unigram count for this
word becomes 0 and (2.1) becomes undefined.

• Since our prediction function is the product of many terms, we do not wish
to have a zero term that just kill the whole thing. This situation occurs when
a new association 𝑤 𝑖−1 𝑤 𝑖 appears in the test set. Since this bigram does not

10
Short title of document 2.3 Zero probability problem and smoothing techniques

appear in the training corpus, this term becomes 0 and so is the prediction
result.

2.3.1 Handling OOV words

In prediction phase, given a sentence, we can mark any new word (does not
appear in training corpus) by converting them into a special symbol < 𝑈 𝑁 𝐾 >
(standing for UNKNOWN). For example, consider the sentence "TRINH TUAN
PHONG is teaching the NLP specialization in 2024". It’s very likely that the
model has absolutely no idea who I am so the sentence will be converted into
"< 𝑈 𝑁 𝐾 >< 𝑈 𝑁 𝐾 >< 𝑈 𝑁 𝐾 > is teaching the NLP specialization in 2024".
Now, a natural question araises: what is the occurence probability of the
word < 𝑈 𝑁 𝐾 >? Since < 𝑈 𝑁 𝐾 > is just a fictive symbol that we have just made,
we have to incorporate < 𝑈 𝑁 𝐾 > in training corpus in some way and compute
its occurence probability. There are probably two common ways to do so.

• The first one is to choose a fixed list of vocabulary 𝑉 in advance. Then,


we transform any word in the training corpus that does not belong to 𝑉 into
< 𝑈 𝑁 𝐾 >. Then, we do a counting to determine the probability of < 𝑈 𝑁 𝐾 >.

• In case that we do not have a fixed of vocabulary in our hand, another way to
assign a probability to < 𝑈 𝑁 𝐾 > is fixing a min frequency 𝑚. All words that
occur less frequently than 𝑚 will be converted into < 𝑈 𝑁 𝐾 >.

2.3.2 Smoothing

To handle the zero factor problem, we make use of various smoothing tech-
niques. Probably the most common smoothing to many of us is Laplace (add-
one) smoothing technique.
Assume that the training corpus has 𝑁 tokens in total with vocal 𝑉 (the set of
unique words in the training corpus). Then, in unigram model, the probability
of the word 𝑤 𝑖 is
𝐶(𝑤 𝑖 )
P(𝑤 𝑖 ) = .
𝑁
Now, if we apply Laplace smoothing, which means that we simply increment
the frequency of each word in the vocabulary by 1. Hence,

𝐶(𝑤 𝑖 ) + 1
P𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (𝑤 𝑖 ) = . (2.5)
𝑁 + |𝑉|

11
Short title of document Lecture 3 – Semantic meaning and word embedding

Note that (2.5) assign a factor of 1/(𝑁 +|𝑉|) for any new word from a sentence
in unigram model.
Similarly, in bigram model,
𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 1 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 1
P𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (𝑤 𝑖 |𝑤 𝑖−1 ) = Í = . (2.6)
𝑤∈𝑉 (𝐶(𝑤 𝑖−1 𝑤) + 1) 𝐶(𝑤 𝑖−1 ) + |𝑉|

If we apply 𝑎𝑑𝑑 − 𝑘 smoothing with any 𝑘 ≥ 1, (2.6) becomes

𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 𝑘 𝐶(𝑤 𝑖−1 𝑤 𝑖 ) + 𝑘


P𝐿𝑎𝑝𝑙𝑎𝑐𝑒 (𝑤 𝑖 |𝑤 𝑖−1 ) = Í = . (2.7)
𝑤∈𝑉 (𝐶(𝑤 𝑖−1 𝑤) + 𝑘) 𝐶(𝑤 𝑖−1 ) + 𝑘|𝑉|

¸ Vector Embedding

3.1 Introduction

In previous chapters, we have been discussing Sentiment Analysis and Auto-


Completion tasks. To solve Sentiment Analysis problem, instead of following
complex human brain kind of way to comprehend and understand natural lan-
guage, we just sticked to the idea that sentiment classification is classfied and
intensified by the frequency of words from the same class instead of compre-
hending. With n-gram model in Auto-Completion task, we come a little bit
closer to Natural Language Understanding (NLU) by trying to crack one little
thing about natural languages: the ordering and structure of word sequences.
However, it is still the tip of the iceberg.
Let’s take a moment to discuss why teaching a machine to understand nat-
ural language is challenging? Is it because of the grammar, the vocabulary or
something else. Well, there are many things to take into consideration here. I
think we all agree that the ultimate goal of NLU is to understand the meaning of
words and sentences. However, the most problematic challenge here is to deal
with the fact that a word can have multiple senses (polysemous). You can only
understand the word sense when you put it into a specific context i.e. put it in
one or multiple specific sentences. Linguistically speaking, if you see a word,
say "bank", what you see is the citation form (lemma) of this word. It can have
the sense as a financial institution or it can indicate a pile or mass of earth as in
"river bank".
Here comes a brilliant observation from various linguists in 1950s: Words
whose neighbors are similar tend to have similar meaning. "You shall know
a word by the company it keeps" as said by the linguist Firth (1957). This

12
Short title of document 3.1 Introduction

observation leads to the idea of analyzing how words co-occur together or how
they are distributed in our language to capture the meaning of words.
Another great idea from linguists such as Osgood et al. in 1957 is determining
a word connotation via certain factors (c.f.)

• Connotation or affective meaning of a word can be seen as the emotional


flavor of a word when you hear or read it. For example, whenever I hear the
word "awesome", it provokes some excitement and young, positive feeling.
In contrary, a word "stressful" might easily give you a negative sentiment.

• According to linguists, one word can be represented as a floating 3-dimensional


vector of three following components: valence (the pleasantness), arousal (the
emotion intensity) and dominance (the degree of control).

• The idea to represent a word sense as a point in R𝑑 with certain 𝑑 is


revolutionary and can be seen as the origin of modern NLP concept of
embedding as we have used widely nowadays.

Following is our tentative definition of embedding concept: Word embed-


ding is a technique that represents words as a vector in R𝑑 for some dimension
𝑑 so that the embedded vectors of "similar words" stay close to each other w.r.t.
certain metrics in R𝑑 .
The word "similarity" might appear everywhere in Data Science, but we
would like to emphasize that similarity is as loose term as possible in NLP.
Linguistically speaking, you can say cat and dog are similar words since they
are both domestic animals that we, human, love generally. Paris and Rome are
similar since both of them are cities of some European countries. That being
said, any pair of words might be similar in this regard. Well, as fancy as it could
be, the infinitely many possibilities of similarity in linguistics turns out to be a
big challenge for machines to comprehend our natural languages. That’s why
transforming the similarity perspective of natural language into mathematical
concept "closeness" between points in some space is a good choice. It makes
explaning similarity mathematically and computationally feasible.
Popular applications of vector embeddings are Machine Translation, Infor-
mation Retrieval and Question Answering.
There exist two concepts of embedding: the static and contextual one. Static
embedding means that we assign a word from the vocab to a unique vector
embedding. In contextual embedding, a word has different embeddings in dif-

13
Short title of document 3.2 Co-occurrence matrices

ferent contexts.

In this chapter, we introduce three static embedding methods: term fre-


quency - inverse document frequency (tf-idf), pointwise mutual information
(PMI) and Word2vec. There will be much more to come for contextualized
embedding in next coming chapters.
This chapter is a very nice illustration of beautiful interaction between ML
algorithm, Information Theory from Computer Science and researches from
Linguistics.

Figure 1: embedding types

3.2 Co-occurrence matrices

One natural way to analyze how words co-occur in a language is to represent


Assume that we are in a typical situation of Information Retrieval (IR) as
follows: We have a collection of 4 documents: Science, Politics, Finance, Economy
and we would like to build a search engine for users on this collection.
Documents: 𝑑1 , 𝑑2 , . . . , 𝑑 𝑘 . Words: 𝑤 1 , 𝑤2 , . . . , 𝑤 |𝑉| . Co-occurence matrix
𝐴 = (𝑎 𝑖𝑗 ) = (count(𝑤 𝑖 , 𝑑 𝑗 )) ∈ ℳ(|𝑉|, 𝑘)

3.3 Term frequency-Inverse Document Frequency

In Section 3.2, we use raw frequency as entries of co-occurence matrices. How-


ever, this quantity has one serious flaw:

• Consider again the document searching problem. Popular words might ap-
pear million times in each document but they are not helpful at all for us to

14
Short title of document 3.4 Pointwise Mutual Information

distinguish a document from the rest of the collection (just like stopwords).

To remedy the situation, we scale the raw frequency 𝑐𝑜𝑢𝑛𝑡(𝑤 𝑖 , 𝑑 𝑗 ) by a factor


which is the inverse ratio number of documents which contain the word 𝑤 𝑖 .
Then, popular words that appear all the time in the collection will reduce the
raw frequency and the rare, characteristic word will make the raw frequency
much bigger.
Precisely, we define the following Firstly, we modify the raw frequency as follows

𝑡 𝑓 (𝑤 𝑖 , 𝑑 𝑗 ) = term frequency(𝑤 𝑖 , 𝑑 𝑗 ) = log10 1 + 𝑐𝑜𝑢𝑛𝑡(𝑤 𝑖 , 𝑑 𝑗 ) (3.1)




Here we take the log as a good routine to avoid numerical instabilities (overflow,
precision loss). We add 1 to handle the case where the count is equal to 0.
Next, let 𝑁 be the number of docs in the collection. Then
𝑑𝑓 (𝑤 𝑖 )
= {number of docs that contain the word 𝑤 𝑖 }/𝑁 . (3.2)
𝑁
is the document frequency w.r.t. the word 𝑤 𝑖 from the collection. Then, we can
define the Inverse Document Frequency (idf) as follows

𝑁
 
𝑖𝑑𝑓 (𝑤 𝑖 ) = 𝑖𝑑𝑓 (𝑤 𝑖 , collection_size) = log10 . (3.3)
𝑑𝑓 (𝑤 𝑖 )

The quantity 𝑖𝑑𝑓 (𝑤 𝑖 ) varies from 0 to +∞, take values close to 0 for undiscrimi-
native (not important) words and the values of discriminative (important) words
will be large.
Hence,

𝑡 𝑓 _𝑖𝑑𝑓 (𝑤 𝑖 , 𝑑 𝑗 ) = 𝑡 𝑓 _𝑖𝑑𝑓 (𝑤 𝑖 , 𝑑 𝑗 , collection_size) = 𝑡 𝑓 (𝑤 𝑖 , 𝑑 𝑗 ) × 𝑖𝑑𝑓 (𝑤 𝑖 ). (3.4)

Our readers are encouraged to replace the raw frequency by above formulation
and re-do the computation in Section 3.2.

3.4 Pointwise Mutual Information

Let’s consider another approach for co-occurrence matrix. Assume a word by


word design. What we are interested in is to estimate the probability that a
pair of words (𝑥, 𝑦) co-occurs in our training corpus. The worst scenario is
that the appearence of one word has nothing to do with the other. In this
case, (𝑥, 𝑦) are realizations of two independent random variables (𝑋 , 𝑌) i.e.
P(𝑥, 𝑦) := P(𝑋 = 𝑥, 𝑌 = 𝑦) = P(𝑋 = 𝑥)P(𝑌 = 𝑦) =: P(𝑥)P(𝑦).

15
Short title of document 3.4 Pointwise Mutual Information

Then, to measure how often two words occur together in the same environment
(neighborhood), we can consider the following so-called Mutual Information
(MI):
P(𝑥, 𝑦)
 
𝑀𝐼(𝑥, 𝑦) = log2 (3.5)
P(𝑥)P(𝑦)
which compares the difference between the joint-occurrence and the indepen-
dent one. In other words, Mutual Information measures how much one random
variable tells us about another i.e. how much the appearence of one word pre-
dicts that of, let’s say, the next word.
Mutual Information is a very interesting concept of Information Theory that de-
serves a little more explanation. It show a nice relation between KullBack-Leibler
divergence, Entropy and Mutual Information.

3.4.1 Entropy

Definition 3.1. Let 𝑋 be a discrete random variale that takes values in 𝒳 and
let 𝑝(𝑥) := P(𝑋 = 𝑥). The entropy 𝐻(𝑋) measures the degree of uncertainty or
randomness carried by 𝑋.
Precisely,

• Let 𝐼(𝑋) := − log𝑏 (𝑝(𝑋)) = log𝑏 (1/𝑝(𝑋)) the information content of 𝑋 for
certain base 𝑏. For a rare event 𝑋 = 𝑥 with small 𝑝(𝑥), 𝐼(𝑋 = 𝑥) is a huge
value. For a very common event which happens all the time i.e. 𝑝(𝑥) is close
to 0, 𝐼(𝑋 = 𝑥) tends to 0. That means a usual information contributes less to
the degree of uncertainty from Information Theory perspective.

• Entropy is defined as the expected amount of information that random vari-


able 𝑋 holds.
Õ
𝐻(𝑋) := E(𝐼(𝑋)) = E(− log𝑏 (𝑝(𝑋))) = − 𝑝(𝑥) log𝑏 (𝑝(𝑥)). (3.6)
𝑥∈𝒳

• There are three common values of the base 𝑏 that we use: 𝑏 = 2 gives the
units of bits (shannons); base 10 gives units of dits; base 𝑒 gives natural units
nat. Throughout the chapter, we omit the subscript 𝑏 whenever we use the
base 2.

Definition 3.2. For a pair of random variable (𝑋 , 𝑌), we can naturally define a
so-called joint entropy 𝐻(𝑋 , 𝑌) as follows.
Õ ÕÕ
𝐻(𝑋 , 𝑌) = − 𝑝(𝑥, 𝑦) log𝑏 (𝑝(𝑥, 𝑦)) = − 𝑝(𝑥, 𝑦) log𝑏 (𝑝(𝑥, 𝑦))
(𝑥,𝑦)∈(𝒳 ,𝒴 ) 𝑥∈𝒳 𝑦∈𝒴

16
Short title of document 3.4 Pointwise Mutual Information

where 𝑝(𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦).

It is not too hard to see that the joint entropy can not surpass the sum of two
marginal (individual) entropies i.e. 𝐻(𝑋 , 𝑌) ≤ 𝐻(𝑋) + 𝐻(𝑌).

Example. Assume that you want to send some encrypted message by using a
sequence of 4 characters ’A’, ’B’, ’C’, and ’D’ over a binary channel. How many
bits that you need to transmit the message?

• If all 4 letters are equally likely (25%), one cannot do better than using two
bits to encode each letter (entropy is equal to 2 bits). ’A’ might code as ’00’,
’B’ as ’01’, ’C’ as ’10’, and ’D’ as ’11’.

• However, if the probabilities of each letter are unequal, say ’A’ occurs with
70% probability, ’B’ with 26%, and ’C’ and ’D’ with 2% each, one could assign
variable length codes. In this case, ’A’ would be coded as ’0’, ’B’ as ’10’, ’C’ as
’110’, and ’D’ as ’111’. With this representation, 70% of the time only one bit
needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On
average, fewer than 2 bits are required since the entropy is lower.

Example. Fair coin as the situation of maximum uncertainty.

Example. English text, treated as a string of characters, has fairly low entropy;
i.e. it is fairly predictable. After the first few letters one can often guess the rest
of the word. English text has between 0.6 and 1.3 bits of entropy per character
of the message.

3.4.2 Conditional Entropy

Definition 3.3. Given two discrete random variables 𝑋 and 𝑌 with support
sets 𝒳 and 𝒴 , conditional entropy 𝐻(𝑋|𝑌) is the quantity that measures the
remain randomness (uncertainty) in 𝑋 given information from 𝑌. In other
words, 𝐻(𝑋|𝑌) measures how much information 𝑌 encodes about 𝑋.
𝑝(𝑥, 𝑦)
Õ  
𝐻(𝑋|𝑌) := − 𝑝(𝑥, 𝑦) log𝑏 (3.7)
𝑝(𝑦)
(𝑥,𝑦)∈(𝒳 ,𝒴 )

where 𝑝(𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦), 𝑝(𝑦) = P(𝑌 = 𝑦).

Following is some fundamental properties of the conditional entropy

1. 𝐻(𝑋|𝑌) = 0 (there is no uncertainty) iff 𝑌 ⇒ 𝑋.

17
Short title of document 3.4 Pointwise Mutual Information

2. 𝐻(𝑋|𝑌) ≤ 𝐻(𝑋) for any 𝑋 , 𝑌. The equality holds iff 𝑋 and 𝑌 are two
independent variables.

3. 𝐻(𝑋|𝑌) = 𝐻(𝑋 , 𝑌) − 𝐻(𝑌) (chain rule).

4. 𝐻(𝑋|𝑌) = 𝐻(𝑌|𝑋) + 𝐻(𝑋) − 𝐻(𝑌) (Bayes rule).

3.4.3 Mutual Information

Definition 3.4. Mutual Information 𝑀𝐼(𝑋 , 𝑌) of two random variables 𝑋 and 𝑌


measures the mutual dependence of 𝑋 and 𝑌.

𝑝(𝑥, 𝑦)
Õ  
𝑀𝐼(𝑋 , 𝑌) := − 𝑝(𝑥, 𝑦) log𝑏 (3.8)
𝑝(𝑥)𝑝(𝑦)
(𝑥,𝑦)∈(𝒳 ,𝒴 )

where 𝑝(𝑥, 𝑦) = P(𝑋 = 𝑥, 𝑌 = 𝑦), 𝑝(𝑥) = P(𝑋 = 𝑥) and 𝑝(𝑦) = P(𝑌 = 𝑦).

1. A nice application of Jensen’s inequality yields 𝑀𝐼(𝑋 , 𝑌) ≥ 0. The equality


holds iff 𝑋 and 𝑌 are independent.

2. 𝑀𝐼(𝑋 , 𝑌) = 𝐻(𝑋) − 𝐻(𝑋|𝑌) for any 𝑋 , 𝑌. These two properties of mutual


information infer that 𝐻(𝑋|𝑌) ≤ 𝐻(𝑋).

3. 𝑀𝐼(𝑋 , 𝑌) = 𝑀𝐼(𝑌, 𝑋).

The second property of Mutual Information gives a nice interpretation for the
concept. 𝑀𝐼(𝑋 , 𝑌) can be seen as the residual uncertainty of 𝑋 after conditioning
𝑋 on 𝑌. Hence, Mutual Information measures how much knowing one random
variable reduces the uncertainty of the other.
Moreover, it also infers that 𝑀𝐼(𝑋 , 𝑌) coincides with Information Gain, a familar
concept that we use in Decision Tree where 𝑋 plays the role of label and 𝑌 is a
specific feature (we choose the feature which maximize the Information Gain to
split).
In terms of comparing a joint probability with the product of probability, the
concept Mutual Information can be linked naturally to the Kullback-Leibler
divergence (KL statistical distance):

𝑀𝐼(𝑋 , 𝑌) = 𝐷𝐾𝐿 P(𝑋 ,𝑌) ||P𝑋 ⊗ P𝑌 (3.9)




where P(𝑋 ,𝑌) is the joint probability and P𝑋 ⊗ P𝑌 is the tensor product which
assigns the probability 𝑝(𝑥) ∗ 𝑝(𝑦) for each (𝑥, 𝑦).

18
Short title of document 3.5 Word2vec

3.5 Word2vec

The (static) embedding with tf-idf and PPMI that we have seen in Section 3.3
and 3.4 have certain criticial flaws

• Embedding vectors generated by these methods are sparse and living a huge
dimensional space (the size of vocab 𝑉). Hence, it is not really practical to
use.

• The ability to capture the semantic meaning of a document, a word is ques-


tionable in these methods. We can see in our examples that the tf-idf embed-
ding vectors for 4 Harry Potter books failed to distinguish those books from
each other.

• word2vec is a kind of embedding in which words are converted in dense vec-


tors of lower dimensions. These kinds of embedding can capture pretty well
semantic meaning and various linguistic relations such as analogy/relational
similarity.

• word2vec was developed by scientists from Google in 2013 by applying two


following techniques: Continuous Bag of Words (CBOW) and Skip-gram
with negative sampling (SGNS).

One brilliant idea to handle this problem is the following.

• Assume that we have such a huge, sparse vector embedding 𝑢 of vocab size
𝑉. Mathematically speaking, we want to reduce the size of 𝑢 by applying
a linear transformation on 𝑢 to get 𝑣 = 𝐴𝑢 where 𝐴 ∈ ℳ(𝑁 , 𝑉) with 𝑁 is
relatively much smaller than 𝑉. In terms of Deep Learning language, it is
equivalent to passing this vector to a neural network with some hidden size
𝑁.

• How about the learning process that such neural network will carry to make
that 𝑁−dimensional vector capture the semantic meaning of a word or a
document?

• The idea introduced by scientists from Google in 2013 is to pick a word as the
target (centered word) and let the neural network take its context words (the
surrounding words) as the input and predict the target. Just like the game
where you have an incomplete sentence "I ... to school" and your machine try
to get the word "go".

19
Short title of document 3.5 Word2vec

• It sound pretty reasonable so far, right. Now the only question left is what
kind of initial embedding as the input that we will use?

• Well, the answer is as simple as it could be. Initially, we can represent a


word in a given vocab as one-hot-encoding vector of size 𝑉. This vector
has a unique entry 1 at the index which is equal to that of the word in the
vocab. Precisely, we consider our training corpus as a Bag of Words (BOW).
The supervised learning structure is pairs of (context words, target word).
The initial embedding of context words is the average of one-hot-vectors for
context words. Those pairs are fed to a neural network for training. The last
hidden matrix will encode all word embeddings.

• This method was introduced as CBOW technique for Word2Vec by some


scientists from Google in 2013.

3.5.1 A simple neural network with two hidden layers

Softmax function 𝜎 : R𝑉 −→ R𝑉 , (𝑤1 , 𝑤2 , . . . , 𝑤𝑉 ) ↦→ (𝜎1 , 𝜎2 , . . . 𝜎𝑉 ) where

𝑒 𝑤𝑖
𝜎𝑖 (𝑤) = Í𝑉 .
𝑤𝑗
𝑗=1 𝑒

By some elementary computation, it is easy to see that

𝜕𝜎𝑖
= 𝜎𝑖 (𝛿 𝑖𝑗 − 𝜎 𝑗 ) (3.10)
𝜕𝑤 𝑗

where 𝛿 𝑖𝑗 equals 1 if 𝑖 = 𝑗 and 0 otherwise (the Kronecker delta notation). In


other words, the Jacobian matrix of 𝜎 has the following form

𝜎1 (1 − 𝜎1 ) −𝜎1 𝜎2 ... −𝜎1 𝜎𝑉 



 −𝜎 𝜎 𝜎2 (1 − 𝜎2 ) ... −𝜎2 𝜎𝑉 

2 1
𝐽(𝑤) =  . (3.11)

 ... ... ... ... 
 −𝜎𝑉 𝜎1 −𝜎𝑉 𝜎2 . . . 𝜎𝑉 (1 − 𝜎𝑉 )
 

def softmax(z):
e_z = numpy.exp(z)
return e_z / numpy.sum(e_z)

20
Short title of document Lecture 4 – Part-Of-Speech and Name-Entity-Recognition

3.5.2 Continuous Bag of Word: idea and implementation

¸ Part-Of-Speech and Name Entity Recognition with Hidden


Markov Model

4.1 Part-Of-Speech and Name-Entity-Recognition

Part-of-Speech (POS): Find POS tags for a given sequence of words.


For example, if our available POS tags are [′ 𝑁 𝑁 ′ ,′ 𝑉 𝐵′ ,′ 𝑂 ′] and the sequence
is "I go to school" then the expected output is

4.1.1 Markov Chain

4.1.2 Hidden Markov Model for POS application

Let 𝒯 = [𝑡1 , 𝑡2 , . . . , 𝑡 𝑁 ] be the list of tags. Think about the list [′ 𝑁 𝑁 ′ ,′ 𝑉 𝐵′ ,′ 𝑂 ′]


as a simple example. Let {𝑤 𝑖 }𝑖∈𝑉 be the vocab.
Hidden Markov Model (HMM) consists of two main parts: transition matrix
and emission matrix. HMM contains a number of states and the transition
probabilities between states. In POS application, the states are our list of tags 𝒯 .
In this problem, since
Transition matrix

• count occurences of tag pairs 𝐶(𝑡 𝑖 , 𝑡 𝑗 ) for 𝑡 𝑖 , 𝑡 𝑗 ∈ 𝒯 .

• determine the transition probability P(𝑡 𝑖 , 𝑡 𝑗 ) i.e. the probability that the tag 𝑡 𝑗
appears next to 𝑡 𝑖 .

𝐶(𝑡 𝑖 , 𝑡 𝑗 ) + 𝛼 𝐶(𝑡 𝑖 , 𝑡 𝑗 ) + 𝛼
P(𝑡 𝑖 |𝑡 𝑗 ) = Í𝑁 = (4.1)
𝐶(𝑡 𝑖 ) + 𝛼𝑁
𝑘=1 𝐶(𝑡 𝑖 , 𝑡 𝑘 ) + 𝛼𝑁
where 𝜀 > 0 is the smoothing constant; 𝐶(𝑡 𝑖 ) is the occurence number of the
tag 𝑡 𝑖 in the training corpus.

• To compute the initial probability, we add the start of sentence symbol < 𝑠 >
at the beginning of each sentence.

Emission matrix


𝐶(𝑡 𝑖 , 𝑤 𝑗 ) + 𝛽 𝐶(𝑡 𝑖 , 𝑤 𝑗 ) + 𝛽
P(𝑤 𝑗 |𝑡 𝑖 ) = Í = (4.2)
𝑤∈𝑉 (𝐶(𝑡 𝑖 , 𝑤 𝑗 )) + 𝛽|𝑉| 𝐶(𝑡 𝑖 ) + 𝛽|𝑉|
21
Short title of document Appendices – Python cheatsheets

where 𝐶(𝑡 𝑖 , 𝑤 𝑗 ) is the number of times that the word 𝑤 𝑗 is associated with the
tag 𝑡 𝑖 in the training corpus.

4.1.3 Viterbi algorithm

¸ Python cheatsheets

5.1 cheatsheet

References

[1] Williams, David. Probability with Martingales. Cambridge University Press,


1991. Print.

22

You might also like