0% found this document useful (0 votes)

28 views40 pages

Session1 2024 - 2025 - Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), defining it as a subfield of AI focused on human-computer language interactions and outlining various NLP tasks such as word-sense disambiguation, named entity recognition, and machine translation. It discusses key concepts like tokenization, word representation methods including one-hot encoding, TF-IDF, and Word2Vec, and highlights the challenges and applications of NLP in fields like medicine and digital marketing. Additionally, it covers techniques for processing text data, such as stemming, lemmatization, and the importance of stop words in NLP pipelines.

Uploaded by

Wiem Ben Romdhane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views40 pages

Session1 2024 - 2025 - Natural Language Processing

Uploaded by

Wiem Ben Romdhane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Natural Language

Processing

Mustapha Ben Haj Miniaoui

AI Research Engineer
I’m interested to know about your NLP background

1. How would you deﬁne yourself in NLP: Beginner, intermediate or expert?

2. How did you know about NLP: School, Online courses, Youtube videos,
Internships, Part time jobs … ?
I’m interested to know about your NLP background

1. NLP applications that come to your mind?

2. NLP models that you have read about, used, trained… ?

3. Programming language that you feel comfortable using?

Contents
1. Introduction to NLP
2. Word Tokenization
1.1 Tokens
1.2 Tokenizers

3. Word Representation
2.1. One Hot Encoding
2.2. TF-IDF
2.3. Word2Vec

4. Recurrent Neural Networks (RNNs)

5. LSTM
----------------------------------------------------------------------------------------------

Introduction to
00 NLP
What is Natural Language Processing ?

A subfield of Artificial Intelligence concerned with the interactions

between computers and human natural languages.

Challenges in natural language processing involve:

● Ability to read/hear,
● Understand,
● Derive meaning from human languages.
Natural Language Processing tasks

● Word-sense disambiguation (WSD): word can have different meaning in different context

“Apple” vs “Apple”

● Name Entity Recognition (NER): extract entities like person, organization, location,.. from sentences

“Ibnou Khaldoun” is from “Tunisia” .

● Part-of-Speech tagging (PoS): identifying if a word is a noun, a verb,...

20 March 1956 “marks” VERB the “declaration” NOUN of Tunisia independence.

Natural Language Processing tasks

● Language Generation: predict the next word or sentence

“The next Tunisian president will be” => “ … ”

● Question Answering (QA)

(“ ‫ ﺣﯾث ﻧﻣت وﺗطورت ﺑﺳرﻋﺔ ﻟﺗﺻﺑﺢ واﺣدة ﻣن أﻛﺑر اﻟﻣواﻗﻊ ﻋﻠﻰ اﻹﻧﺗرﻧت‬،2001 ‫“ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم‬ , ”‫)”ﻣﺗﻰ ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ؟‬

=> “2001 ‫”ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم‬

● Machine Translation (MT)

‫ا‬ ”‫ واﻟﻣﯾﺗﺎﻓﯾزﯾﻘﻲ‬،‫ واﻷﺧﻼق‬،‫ ﻻ ﺳﯾﻣﺎ ﻓﻲ ﻣوﺿوﻋﺎت اﻟﻣﻧطق‬،‫”ﻛﺗب اﺑن ﺳﯾﻧﺎ ﻋن اﻟﻔﻠﺳﻔﺔ اﻹﺳﻼﻣﯾﺔ اﻟﻣﺑﻛرة‬

=> “Avicenna wrote about early Islamic philosophy, especially in the subjects of logic, ethics, and metaphysics”
Some Application of Natural Language Processing

Application Of Natural Language Processing

● Medical Field

● Online Advertising and Digital Marketing

● Security Field

● Transportation ...
Smart Replies

Smart reply
in Inbox by Gmail

Source
Language Translation
Sentiment Analysis

Source
Spam Detection

Source
----------------------------------------------------------------------------------------------

Word
01 Tokenization
Key words
❖Corpus: A collection of text documents used for linguistic analysis or training machine learning
models.

❖Document: An individual piece or unit of text within a corpus, such as an article, book, or any
other text source.

❖Token: A unit of text obtained through tokenization, often a word or subword, representing the
basic building blocks of a language.

❖N-grams: Contiguous sequences of N items (usually words) from a given sample of text or
speech. N-grams are used in various natural language processing tasks to capture contextual
information.
Key words
Key words

[
“I am an ENIT student”,
“We are studying NLP”,
“Let’s answer this question”
]

Define the
- Corpus
- Documents
- Token
Word Tokenization

● Tokenization is a particular kind of document segmentation. Segmentation can include

breaking: a document into paragraphs, paragraphs into sentences, sentences into phrases,
or phrases into tokens which is called tokenization.

★ The simplest way to tokenize a sentence is to use

whitespace within a string as the “delimiter” of
tokens.
★ The collection of documents is called a “corpus”.
★ The set of all unique tokens is called the
“vocabulary”.
★ The number of unique tokens is your “vocabulary
size”
Word Tokenization

●Tokens are generally deﬁned by words, punctuation marks, and numbers. But we can easily extend
their deﬁnition to any other units of meaning contained in a sequence of characters, like ASCII emoticons,
Unicode emojis, mathematical symbols, and so on...
Word Tokenization: N-grams

● An n-gram is a sequence containing up to n elements that have been extracted from a sequence of
those elements.

● Extending our concept of a token to include multi-word tokens, will help us retain much of the meaning
inherent in the order of words in your statements.

★ NLTK library is here to help as extract

n-grams easily
Word Tokenization: Tokens

★ In this session we will focus on English language

● Retrieving tokens from a document will require some string manipulation beyond just the str.split()
method. You’ll have to think about :

❖Prefixes, and suffixes: “re,” “pre,” and “ing” have intrinsic meaning.
❖Compound words: Is “ice cream” one word or two to you “ice” and “cream”?
❖Invisible words: The single statement “Don’t!” means “Don’t you do that!” or “You, do not do
that!”
❖Words multiple meaning: Words interpretation, “apple” the fruit or “Apple” the brand
Tokenization: Case Normalization

●With case normalization, we are attempting to return tokens to their “normal” state before
grammar rules and their position in a sentence, by lowercasing them.

●Normalizing word and character capitalization is one way to reduce your vocabulary size. It
helps you consolidate words that are intended to mean the same thing under a single token.

★ Undoing the denormalization is

called “case normalization”, or more
commonly, “case folding”.
★ Case normalization is useless for
languages that don’t have a
concept of capitalization!
Word Tokenization: Stop Words

● Stop words are common words in any language that occur with a high frequency but carry much
less substantive information about the meaning of a phrase. Examples of some common stop
words include:
❖a, an
❖the, this
❖and, or

● Stop words have been excluded from NLP pipelines in order to reduce the computational effort to
extract information from a text without significantly affecting their meaning.
https://www.ranks.nl/stopwords

★ NLTK library contains a dictionary of pre-deﬁned

english stop words that can be used directly
Word Tokenization: Stemming

● We want to eliminate the small meaning diﬀerences of pluralization or possessive endings of words, or
even various verb forms. For example:
❖“house”, “houses” and “housing” share the same stem, “hous”
❖“developer”, “development” and “developing” share the same stem, “develop”

● Stemming reduces the size of your vocabulary while limiting the loss of information and meaning.
● It helps generalize your language model, enabling the model to behave identically for all the words
included in a stem.

★ A stem isn’t required to be a properly

spelled word!
Word Tokenization: Lemmatization

● Going down to the semantic root of a word —its lemma— is called lemmatization. For example:
❖“better”, POS=adjective has as lemmer, “good”

● Lemmatization reduces the dimensionality of your language model.

● It takes into account a word’s meaning.

★ You must tell your Lemmatizer which part of speech your are
interested in, if you want to ﬁnd the most accurate lemma
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”
▪ A1: There are 12 tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “a” , “term” , “or” ,
“a” , “word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter
▪ A2: There are 11 unique tokens: “A” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “a” , “word!”

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (A, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
Word Tokenization: Answers!
▪ Q1: How many tokens are in the following sentence, using only a
whitespace delimiter: “A token is often referred to as a term or a word!”

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization

▪ Q2: What is the vocabulary size for the same sentence using the same
delimiter and considering case normalization
▪ A2: There are 10 unique tokens: “a” , “token” , “is” , “often” , “referred” , “to” , “as” , “term” , “or”
, “word!”

▪ Q3: How many 2-grams are in the same sentence using the same
delimiter:
▪ A3: There are 11 2-grams: (a, token), (token, is), (is, often), (often, referred), (referred, to), (to, as), (as,
a), (a,term), (term, or), (or, a), (a, word)
----------------------------------------------------------------------------------------------

Word
02 Representation
NLP Pipeline

Train Neural Networks to Use Trained Neural Network

Represent Words
Understand Their Meaning To Predict An Output

0
0
1
1
0
2
0
3
0
4
1
5
1
6
0

Source
Word Representation: One Hot Encoding

One way to represent the tokens is to transform them into a sequence/table of numbers. In this
representation:

❖Each row of the table is a binary row vector representing a single word.
❖Each row vector contains a lot of zeros “0” and a one “1”.
❖A one “1” means on, or hot. A zero “0” means oﬀ, or absent.
Word Representation: One Hot Encoding

● This solve the ﬁrst problem of NLP: Turning a sentence of natural language words into a sequence
of numbers or vectors that a computer can “understand.”
● We haven’t lost any words, all information was retained.

● Most of our counts are zero, even for large documents with verbose vocabulary.
● For a long document this might not be practical. Your document size (the length of the
vector table) would grow to be huge.
● We haven’t lost any words true, but we have lost meaning!

➔ We retained the order of words, but expanded the

dimensionality of our NLP problem.
➔ What we really want to do is compress the meaning
of a document down to its essence.
➔ We just want to capture most of the meaning
(information) in a document, not all of it!
Word Representation: TF-IDF

Source
Word Representation: Word2Vec
● In the previous word representation, we ignored:
❖The nearby context of a word.
❖The words around each word.
❖The eﬀect the neighbors of a word have on its meaning

● Word vectors are numerical vector representations of word semantics, or meaning. So word vectors can
capture the connotation of words, like “peopleness,” “animalness,” “placeness,” “thingness,” and even
“conceptness.”

Source
Word Representation: Word2Vec
● The network consists of two layers of weights, where the hidden layer consists of n neurons; n is the number of vector
dimensions used to represent a word. Both the input and output layers contain M neurons, where M is the number of
words in the model’s vocabulary. The output layer activation function is a softmax, which is commonly used for
classiﬁcation problems.

Source

● Word2vec has an unsupervised nature (no need for labeled, categorized, structured text data).
● There are two possible ways to train Word2vec embeddings:
❖The skip-gram
❖The continuous bag-of-words
Word Representation: Vectors calculation - skip-gram

★ The skip-gram approach predicts the context of words (output words) from a word of interest (the input word).

In this example, the input word is “Monet”, and the expected output of the network is either “Claude” or “painted”

Claude Monet painted the Grand Canal of Venis in 1908.

Source
Word Representation: Vectors calculation - continuous
bag-of-words
★ The continuous bag-of-words approach predicts the target word (the output word) from the nearby words (input
words).

In this example, we create a multi-hot vector of all surrounding terms “Claude”, “Monet”, “the”, “Grand” as an input vector
to predict the output token painted.

Claude Monet painted the Grand Canal of Venis in 1908.

Source
Word Representation: gensim.word2vec module

★ Luckily, for most applications, you won’t need

to compute your own word vectors.
★ Pretrained word vector representations are
available for use:
★ Google provides a pretrained Word2vec
model based on English Google News
articles.
Word2Vec: Limitations

▪ Inability to capture words order

▪ Out of vocabulary words

▪ Lack of Polysemy Handling

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
NLP Material
No ratings yet
NLP Material
250 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Exam Ref AI-900 Microsoft Azure AI Fundame - Julian Sharp
100% (1)
Exam Ref AI-900 Microsoft Azure AI Fundame - Julian Sharp
371 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
3.chapter4 - Lexical Representations
No ratings yet
3.chapter4 - Lexical Representations
36 pages
ARTIFICIAL INTELLIGENCE Class 7
100% (1)
ARTIFICIAL INTELLIGENCE Class 7
14 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Corpora
No ratings yet
Corpora
48 pages
Deep Learning For Natural Language Processing
100% (2)
Deep Learning For Natural Language Processing
246 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Week 3
No ratings yet
Week 3
15 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Previous Year Question Paper NLP
No ratings yet
Previous Year Question Paper NLP
5 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP Study
No ratings yet
NLP Study
48 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
No ratings yet
64 Natural Language Processing Interview Questions and Answers-18 Juli 2019
30 pages
Introduction To Natural Language Processing: Unit 1
No ratings yet
Introduction To Natural Language Processing: Unit 1
60 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
25 pages
Lec 2
No ratings yet
Lec 2
21 pages
Unit I
No ratings yet
Unit I
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Duolingo English Test Orientation.
100% (1)
Duolingo English Test Orientation.
10 pages
Afaan Oromo Word Prediction Muazhassen Thesis Master DDUCSIT2022 PDF
100% (2)
Afaan Oromo Word Prediction Muazhassen Thesis Master DDUCSIT2022 PDF
73 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
470 pages
Unified Analytics Disrupting Traditional Healthcare Delivery and Driving The Future of Health
No ratings yet
Unified Analytics Disrupting Traditional Healthcare Delivery and Driving The Future of Health
13 pages
Study On AI - State of Art
No ratings yet
Study On AI - State of Art
13 pages
The Impact of AI On Tax Compliance and Reporting
No ratings yet
The Impact of AI On Tax Compliance and Reporting
20 pages
Natural Language Processing Report (By Sandeep Kumar Dash)
No ratings yet
Natural Language Processing Report (By Sandeep Kumar Dash)
25 pages
Disease Diagnosis Using Chatbot
No ratings yet
Disease Diagnosis Using Chatbot
66 pages
Performance of Ai Generated Content in Content Marketing
No ratings yet
Performance of Ai Generated Content in Content Marketing
60 pages
Vectors
No ratings yet
Vectors
22 pages
A Mental Health Tracker App Using Flutter and Firebase: SAIRISHI R (21137068) ABHISHEK TIMOTHY (21137009)
No ratings yet
A Mental Health Tracker App Using Flutter and Firebase: SAIRISHI R (21137068) ABHISHEK TIMOTHY (21137009)
28 pages
RM Unit 1-2 Material
No ratings yet
RM Unit 1-2 Material
24 pages
Minor Assignment-3 (NLP)
No ratings yet
Minor Assignment-3 (NLP)
2 pages
Ieee 12
No ratings yet
Ieee 12
15 pages
Lalitha Priyanka
No ratings yet
Lalitha Priyanka
8 pages
AI Tools in Research
No ratings yet
AI Tools in Research
8 pages
Advances in Natural Language Processing - A Survey of Current Research Trends, Development Tools and Industry Ap..
No ratings yet
Advances in Natural Language Processing - A Survey of Current Research Trends, Development Tools and Industry Ap..
4 pages
The Oxford Handbook of Computational Linguistics 2nd Edition Ruslan Mitkov (Editor) All Chapters Instant Download
100% (1)
The Oxford Handbook of Computational Linguistics 2nd Edition Ruslan Mitkov (Editor) All Chapters Instant Download
57 pages
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
No ratings yet
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
6 pages
Sample Paper
No ratings yet
Sample Paper
12 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
11 pages
428 Training Text Classifiers in Create ML
No ratings yet
428 Training Text Classifiers in Create ML
31 pages
The University of Arizona
No ratings yet
The University of Arizona
16 pages
New - AtHomeWithAI Resources PDF
No ratings yet
New - AtHomeWithAI Resources PDF
6 pages
Introduction To Statistical Machine Learning by Masashi Sugiyama b019mr8pqm PDF
No ratings yet
Introduction To Statistical Machine Learning by Masashi Sugiyama b019mr8pqm PDF
5 pages
The Beginner’s Guide to Tone of Voice: Stop sounding generic, and give your brand a voice.
From Everand
The Beginner’s Guide to Tone of Voice: Stop sounding generic, and give your brand a voice.
The SEO Works
3/5 (1)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

Session1 2024 - 2025 - Natural Language Processing

Uploaded by

Session1 2024 - 2025 - Natural Language Processing

Uploaded by

Natural Language

Mustapha Ben Haj Miniaoui

1. How would you deﬁne yourself in NLP: Beginner, intermediate or expert?

1. NLP applications that come to your mind?

2. NLP models that you have read about, used, trained… ?

3. Programming language that you feel comfortable using?

4. Recurrent Neural Networks (RNNs)

A subfield of Artificial Intelligence concerned with the interactions

Challenges in natural language processing involve:

“Ibnou Khaldoun” is from “Tunisia” .

● Part-of-Speech tagging (PoS): identifying if a word is a noun, a verb,...

20 March 1956 “marks” VERB the “declaration” NOUN of Tunisia independence.

● Language Generation: predict the next word or sentence

“The next Tunisian president will be” => “ … ”

● Question Answering (QA)

=> “2001 ‫”ﻧﺷﺄت وﯾﻛﯾﺑﯾدﯾﺎ ﻓﻲ ﻋﺎم‬

● Machine Translation (MT)

Application Of Natural Language Processing

● Online Advertising and Digital Marketing

● Tokenization is a particular kind of document segmentation. Segmentation can include

★ The simplest way to tokenize a sentence is to use

★ NLTK library is here to help as extract

★ In this session we will focus on English language

★ Undoing the denormalization is

★ NLTK library contains a dictionary of pre-deﬁned

★ A stem isn’t required to be a properly

● Lemmatization reduces the dimensionality of your language model.

Train Neural Networks to Use Trained Neural Network

➔ We retained the order of words, but expanded the

Claude Monet painted the Grand Canal of Venis in 1908.

Claude Monet painted the Grand Canal of Venis in 1908.

★ Luckily, for most applications, you won’t need

▪ Inability to capture words order

▪ Out of vocabulary words

▪ Lack of Polysemy Handling

You might also like