0% found this document useful (0 votes)
28 views40 pages

Session1 2024 - 2025 - Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), defining it as a subfield of AI focused on human-computer language interactions and outlining various NLP tasks such as word-sense disambiguation, named entity recognition, and machine translation. It discusses key concepts like tokenization, word representation methods including one-hot encoding, TF-IDF, and Word2Vec, and highlights the challenges and applications of NLP in fields like medicine and digital marketing. Additionally, it covers techniques for processing text data, such as stemming, lemmatization, and the importance of stop words in NLP pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views40 pages

Session1 2024 - 2025 - Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), defining it as a subfield of AI focused on human-computer language interactions and outlining various NLP tasks such as word-sense disambiguation, named entity recognition, and machine translation. It discusses key concepts like tokenization, word representation methods including one-hot encoding, TF-IDF, and Word2Vec, and highlights the challenges and applications of NLP in fields like medicine and digital marketing. Additionally, it covers techniques for processing text data, such as stemming, lemmatization, and the importance of stop words in NLP pipelines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Natural Language

Processing

Mustapha Ben Haj Miniaoui


AI Research Engineer
I’m interested to know about your NLP background

1. How would you define yourself in NLP: Beginner, intermediate or expert?

2. How did you know about NLP: School, Online courses, Youtube videos,
Internships, Part time jobs … ?
I’m interested to know about your NLP background

1. NLP applications that come to your mind?

2. NLP models that you have read about, used, trained… ?

3. Programming language that you feel comfortable using?


Contents
1. Introduction to NLP
2. Word Tokenization
1.1 Tokens
1.2 Tokenizers

3. Word Representation
2.1. One Hot Encoding
2.2. TF-IDF
2.3. Word2Vec

4. Recurrent Neural Networks (RNNs)


5. LSTM
----------------------------------------------------------------------------------------------

Introduction to
00 NLP
What is Natural Language Processing ?

A subfield of Artificial Intelligence concerned with the interactions


between computers and human natural languages.

Challenges in natural language processing involve:


● Ability to read/hear,
● Understand,
● Derive meaning from human languages.
Natural Language Processing tasks

● Word-sense disambiguation (WSD): word can have different meaning in different context

“Apple” vs “Apple”

● Name Entity Recognition (NER): extract entities like person, organization, location,.. from sentences

“Ibnou Khaldoun” is from “Tunisia” .

● Part-of-Speech tagging (PoS): identifying if a word is a noun, a verb,...

20 March 1956 “marks” VERB the “declaration” NOUN of Tunisia independence.


Natural Language Processing tasks

● Language Generation: predict the next word or sentence

“The next Tunisian president will be” => “ … ”

● Question Answering (QA)

(“ ‫ ﺣﯾث ﻧﻣت وﺗطورت ﺑﺳرﻋﺔ ﻟﺗﺻﺑﺢ واﺣدة ﻣن أﻛﺑر اﻟﻣواﻗﻊ ﻋﻠﻰ اﻹﻧﺗرﻧت‬،2001 ‫“ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم‬ , ”‫)”ﻣﺗﻰ ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ؟‬

=> “2001 ‫”ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم‬

● Machine Translation (MT)

‫ا‬ ”‫ واﻟﻣﯾﺗﺎﻓﯾزﯾﻘﻲ‬،‫ واﻷﺧﻼق‬،‫ ﻻ ﺳﯾﻣﺎ ﻓﻲ ﻣوﺿوﻋﺎت اﻟﻣﻧطق‬،‫”ﻛﺗب اﺑن ﺳﯾﻧﺎ ﻋن اﻟﻔﻠﺳﻔﺔ اﻹﺳﻼﻣﯾﺔ اﻟﻣﺑﻛرة‬

=> “Avicenna wrote about early Islamic philosophy, especially in the subjects of logic, ethics, and metaphysics”
Some Application of Natural Language Processing

Application Of Natural Language Processing

● Medical Field

● Online Advertising and Digital Marketing

● Security Field

● Transportation ...
Smart Replies

Smart reply
in Inbox by Gmail

Source
Language Translation
Sentiment Analysis

Source
Spam Detection

Source
----------------------------------------------------------------------------------------------

Word
01 Tokenization
Key words
❖Corpus: A collection of text documents used for linguistic analysis or training machine learning
models.

❖Document: An individual piece or unit of text within a corpus, such as an article, book, or any
other text source.

❖Token: A unit of text obtained through tokenization, often a word or subword, representing the
basic building blocks of a language.

❖N-grams: Contiguous sequences of N items (usually words) from a given sample of text or
speech. N-grams are used in various natural language processing tasks to capture contextual
information.
Key words
Key words

[
“I am an ENIT student”,
“We are studying NLP”,
“Let’s answer this question”
]

Define the
- Corpus
- Documents
- Token
Word Tokenization

● Tokenization is a particular kind of document segmentation. Segmentation can include


breaking: a document into paragraphs, paragraphs into sentences, sentences into phrases,
or phrases into tokens which is called tokenization.

★ The simplest way to tokenize a sentence is to use


whitespace within a string as the “delimiter” of
tokens.
★ The collection of documents is called a “corpus”.
★ The set of all unique tokens is called the
“vocabulary”.
★ The number of unique tokens is your “vocabulary
size”
Word Tokenization

●Tokens are generally defined by words, punctuation marks, and numbers. But we can easily extend
their definition to any other units of meaning contained in a sequence of characters, like ASCII emoticons,
Unicode emojis, mathematical symbols, and so on...
Word Tokenization: N-grams

● An n-gram is a sequence containing up to n elements that have been extracted from a sequence of
those elements.

● Extending our concept of a token to include multi-word tokens, will help us retain much of the meaning
inherent in the order of words in your statements.

★ NLTK library is here to help as extract


n-grams easily
Word Tokenization: Tokens

★ In this session we will focus on English language

● Retrieving tokens from a document will require some string manipulation beyond just the str.split()
method. You’ll have to think about :

❖Prefixes, and suffixes: “re,” “pre,” and “ing” have intrinsic meaning.
❖Compound words: Is “ice cream” one word or two to you “ice” and “cream”?
❖Invisible words: The single statement “Don’t!” means “Don’t you do that!” or “You, do not do
that!”
❖Words multiple meaning: Words interpretation, “apple” the fruit or “Apple” the brand
Tokenization: Case Normalization

●With case normalization, we are attempting to return tokens to their “normal” state before
grammar rules and their position in a sentence, by lowercasing them.

●Normalizing word and character capitalization is one way to reduce your vocabulary size. It
helps you consolidate words that are intended to mean the same thing under a single token.

★ Undoing the denormalization is


called “case normalization”, or more
commonly, “case folding”.
★ Case normalization is useless for
languages that don’t have a
concept of capitalization!
Word Tokenization: Stop Words

● Stop words are common words in any language that occur with a high frequency but carry much
less substantive information about the meaning of a phrase. Examples of some common stop
words include:
❖a, an
❖the, this
❖and, or

● Stop words have been excluded from NLP pipelines in order to reduce the computational effort to
extract information from a text without significantly affecting their meaning.
https://www.ranks.nl/stopwords

★ NLTK library contains a dictionary of pre-defined


english stop words that can be used directly
Word Tokenization: Stemming

● We want to eliminate the small meaning differences of pluralization or possessive endings of words, or
even various verb forms. For example:
❖“house”, “houses” and “housing” share the same stem, “hous”
❖“developer”, “development” and “developing” share the same stem, “develop”

● Stemming reduces the size of your vocabulary while limiting the loss of information and meaning.
● It helps generalize your language model, enabling the model to behave identically for all the words
included in a stem.

★ A stem isn’t required to be a properly


spelled word!
Word Tokenization: Lemmatization

● Going down to the semantic root of a word —its lemma— is called lemmatization. For example:
❖“better”, POS=adjective has as lemmer, “good”

● Lemmatization reduces the dimensionality of your language model.


● It takes into account a word’s meaning.

★ You must tell your Lemmatizer which part of speech your are
interested in, if you want to find the most accurate lemma
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ A1: There are 12 tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “a” , “term” , “or” ,
“a” , “word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter
▪ A2: There are 11 unique tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “a” , “word!”

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (A, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ A1: There are 12 tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “a” , “term” , “or” ,
“a” , “word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization
▪ A2: There are 10 unique tokens: “a” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “word!”

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (a, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
----------------------------------------------------------------------------------------------

Word
02 Representation
NLP Pipeline

Train Neural Networks to Use Trained Neural Network


Represent Words
Understand Their Meaning To Predict An Output

0
0
1
1
0
2
0
3
0
4
1
5
1
6
0

Source
Word Representation: One Hot Encoding

One way to represent the tokens is to transform them into a sequence/table of numbers. In this
representation:

❖Each row of the table is a binary row vector representing a single word.
❖Each row vector contains a lot of zeros “0” and a one “1”.
❖A one “1” means on, or hot. A zero “0” means off, or absent.
Word Representation: One Hot Encoding

● This solve the first problem of NLP: Turning a sentence of natural language words into a sequence
of numbers or vectors that a computer can “understand.”
● We haven’t lost any words, all information was retained.

● Most of our counts are zero, even for large documents with verbose vocabulary.
● For a long document this might not be practical. Your document size (the length of the
vector table) would grow to be huge.
● We haven’t lost any words true, but we have lost meaning!

➔ We retained the order of words, but expanded the


dimensionality of our NLP problem.
➔ What we really want to do is compress the meaning
of a document down to its essence.
➔ We just want to capture most of the meaning
(information) in a document, not all of it!
Word Representation: TF-IDF

Source
Word Representation: Word2Vec
● In the previous word representation, we ignored:
❖The nearby context of a word.
❖The words around each word.
❖The effect the neighbors of a word have on its meaning

● Word vectors are numerical vector representations of word semantics, or meaning. So word vectors can
capture the connotation of words, like “peopleness,” “animalness,” “placeness,” “thingness,” and even
“conceptness.”

Source
Word Representation: Word2Vec
● The network consists of two layers of weights, where the hidden layer consists of n neurons; n is the number of vector
dimensions used to represent a word. Both the input and output layers contain M neurons, where M is the number of
words in the model’s vocabulary. The output layer activation function is a softmax, which is commonly used for
classification problems.

Source

● Word2vec has an unsupervised nature (no need for labeled, categorized, structured text data).
● There are two possible ways to train Word2vec embeddings:
❖The skip-gram
❖The continuous bag-of-words
Word Representation: Vectors calculation - skip-gram

★ The skip-gram approach predicts the context of words (output words) from a word of interest (the input word).

In this example, the input word is “Monet”, and the expected output of the network is either “Claude” or “painted”

Claude Monet painted the Grand Canal of Venis in 1908.

Source
Word Representation: Vectors calculation - continuous
bag-of-words
★ The continuous bag-of-words approach predicts the target word (the output word) from the nearby words (input
words).

In this example, we create a multi-hot vector of all surrounding terms “Claude”, “Monet”, “the”, “Grand” as an input vector
to predict the output token painted.

Claude Monet painted the Grand Canal of Venis in 1908.

Source
Word Representation: gensim.word2vec module

★ Luckily, for most applications, you won’t need


to compute your own word vectors.
★ Pretrained word vector representations are
available for use:
★ Google provides a pretrained Word2vec
model based on English Google News
articles.
Word2Vec: Limitations

▪ Inability to capture words order

▪ Out of vocabulary words

▪ Lack of Polysemy Handling

You might also like