Session1 2024 - 2025 - Natural Language Processing
Session1 2024 - 2025 - Natural Language Processing
Processing
2. How did you know about NLP: School, Online courses, Youtube videos,
Internships, Part time jobs … ?
I’m interested to know about your NLP background
3. Word Representation
2.1. One Hot Encoding
2.2. TF-IDF
2.3. Word2Vec
Introduction to
00 NLP
What is Natural Language Processing ?
● Word-sense disambiguation (WSD): word can have different meaning in different context
“Apple” vs “Apple”
● Name Entity Recognition (NER): extract entities like person, organization, location,.. from sentences
(“ ﺣﯾث ﻧﻣت وﺗطورت ﺑﺳرﻋﺔ ﻟﺗﺻﺑﺢ واﺣدة ﻣن أﻛﺑر اﻟﻣواﻗﻊ ﻋﻠﻰ اﻹﻧﺗرﻧت،2001 “ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم , ”)”ﻣﺗﻰ ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ؟
ا ” واﻟﻣﯾﺗﺎﻓﯾزﯾﻘﻲ، واﻷﺧﻼق، ﻻ ﺳﯾﻣﺎ ﻓﻲ ﻣوﺿوﻋﺎت اﻟﻣﻧطق،”ﻛﺗب اﺑن ﺳﯾﻧﺎ ﻋن اﻟﻔﻠﺳﻔﺔ اﻹﺳﻼﻣﯾﺔ اﻟﻣﺑﻛرة
=> “Avicenna wrote about early Islamic philosophy, especially in the subjects of logic, ethics, and metaphysics”
Some Application of Natural Language Processing
● Medical Field
● Security Field
● Transportation ...
Smart Replies
Smart reply
in Inbox by Gmail
Source
Language Translation
Sentiment Analysis
Source
Spam Detection
Source
----------------------------------------------------------------------------------------------
Word
01 Tokenization
Key words
❖Corpus: A collection of text documents used for linguistic analysis or training machine learning
models.
❖Document: An individual piece or unit of text within a corpus, such as an article, book, or any
other text source.
❖Token: A unit of text obtained through tokenization, often a word or subword, representing the
basic building blocks of a language.
❖N-grams: Contiguous sequences of N items (usually words) from a given sample of text or
speech. N-grams are used in various natural language processing tasks to capture contextual
information.
Key words
Key words
[
“I am an ENIT student”,
“We are studying NLP”,
“Let’s answer this question”
]
Define the
- Corpus
- Documents
- Token
Word Tokenization
●Tokens are generally defined by words, punctuation marks, and numbers. But we can easily extend
their definition to any other units of meaning contained in a sequence of characters, like ASCII emoticons,
Unicode emojis, mathematical symbols, and so on...
Word Tokenization: N-grams
● An n-gram is a sequence containing up to n elements that have been extracted from a sequence of
those elements.
● Extending our concept of a token to include multi-word tokens, will help us retain much of the meaning
inherent in the order of words in your statements.
● Retrieving tokens from a document will require some string manipulation beyond just the str.split()
method. You’ll have to think about :
❖Prefixes, and suffixes: “re,” “pre,” and “ing” have intrinsic meaning.
❖Compound words: Is “ice cream” one word or two to you “ice” and “cream”?
❖Invisible words: The single statement “Don’t!” means “Don’t you do that!” or “You, do not do
that!”
❖Words multiple meaning: Words interpretation, “apple” the fruit or “Apple” the brand
Tokenization: Case Normalization
●With case normalization, we are attempting to return tokens to their “normal” state before
grammar rules and their position in a sentence, by lowercasing them.
●Normalizing word and character capitalization is one way to reduce your vocabulary size. It
helps you consolidate words that are intended to mean the same thing under a single token.
● Stop words are common words in any language that occur with a high frequency but carry much
less substantive information about the meaning of a phrase. Examples of some common stop
words include:
❖a, an
❖the, this
❖and, or
● Stop words have been excluded from NLP pipelines in order to reduce the computational effort to
extract information from a text without significantly affecting their meaning.
https://www.ranks.nl/stopwords
● We want to eliminate the small meaning differences of pluralization or possessive endings of words, or
even various verb forms. For example:
❖“house”, “houses” and “housing” share the same stem, “hous”
❖“developer”, “development” and “developing” share the same stem, “develop”
● Stemming reduces the size of your vocabulary while limiting the loss of information and meaning.
● It helps generalize your language model, enabling the model to behave identically for all the words
included in a stem.
● Going down to the semantic root of a word —its lemma— is called lemmatization. For example:
❖“better”, POS=adjective has as lemmer, “good”
★ You must tell your Lemmatizer which part of speech your are
interested in, if you want to find the most accurate lemma
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter
▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ A1: There are 12 tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “a” , “term” , “or” ,
“a” , “word!”
▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter
▪ A2: There are 11 unique tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “a” , “word!”
▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (A, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization
▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ A1: There are 12 tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “a” , “term” , “or” ,
“a” , “word!”
▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization
▪ A2: There are 10 unique tokens: “a” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “word!”
▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (a, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
----------------------------------------------------------------------------------------------
Word
02 Representation
NLP Pipeline
0
0
1
1
0
2
0
3
0
4
1
5
1
6
0
Source
Word Representation: One Hot Encoding
One way to represent the tokens is to transform them into a sequence/table of numbers. In this
representation:
❖Each row of the table is a binary row vector representing a single word.
❖Each row vector contains a lot of zeros “0” and a one “1”.
❖A one “1” means on, or hot. A zero “0” means off, or absent.
Word Representation: One Hot Encoding
● This solve the first problem of NLP: Turning a sentence of natural language words into a sequence
of numbers or vectors that a computer can “understand.”
● We haven’t lost any words, all information was retained.
● Most of our counts are zero, even for large documents with verbose vocabulary.
● For a long document this might not be practical. Your document size (the length of the
vector table) would grow to be huge.
● We haven’t lost any words true, but we have lost meaning!
Source
Word Representation: Word2Vec
● In the previous word representation, we ignored:
❖The nearby context of a word.
❖The words around each word.
❖The effect the neighbors of a word have on its meaning
● Word vectors are numerical vector representations of word semantics, or meaning. So word vectors can
capture the connotation of words, like “peopleness,” “animalness,” “placeness,” “thingness,” and even
“conceptness.”
Source
Word Representation: Word2Vec
● The network consists of two layers of weights, where the hidden layer consists of n neurons; n is the number of vector
dimensions used to represent a word. Both the input and output layers contain M neurons, where M is the number of
words in the model’s vocabulary. The output layer activation function is a softmax, which is commonly used for
classification problems.
Source
● Word2vec has an unsupervised nature (no need for labeled, categorized, structured text data).
● There are two possible ways to train Word2vec embeddings:
❖The skip-gram
❖The continuous bag-of-words
Word Representation: Vectors calculation - skip-gram
★ The skip-gram approach predicts the context of words (output words) from a word of interest (the input word).
In this example, the input word is “Monet”, and the expected output of the network is either “Claude” or “painted”
Source
Word Representation: Vectors calculation - continuous
bag-of-words
★ The continuous bag-of-words approach predicts the target word (the output word) from the nearby words (input
words).
In this example, we create a multi-hot vector of all surrounding terms “Claude”, “Monet”, “the”, “Grand” as an input vector
to predict the output token painted.
Source
Word Representation: gensim.word2vec module