0% found this document useful (0 votes)

20 views19 pages

NLP Part

Here are the key steps in calculating TF-IDF for terms in documents: 1. Calculate TF (Term Frequency) of each term in a document. TF is the number of occurrences of the term in the document divided by the total number of terms in the document. 2. Calculate DF (Document Frequency) of each term in the corpus. DF is the number of documents containing the term. 3. Calculate IDF (Inverse Document Frequency) of each term using the formula: IDF = log(N/DF) where N is the total number of documents. 4. Calculate TF-IDF by multiplying TF and IDF: TF-IDF = TF * IDF. This gives more weight to terms that occur frequently in a

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views19 pages

NLP Part

Uploaded by

아이 커Iker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Artificial Intelligence for Natural Language Processing (NLP)

Part II – From Word to Numerical Analysis

Dr. Eng. Wael Ouarda
Assistant Professor, CRNS, Higher Education Ministry, Tunisia

Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie

Wael Ouarda - CRNS 1

1. Machine Learning algorithm for NLP
100 persons
7 emotions

Data Scrapping
85 persons 15 persons
Train&Val Test
Data Cleaning
Pr
Data Representation Model
Word Embedding Embedding
85 * 0,8 85 * 0,2
Train Validation
Data Partitioning

Train Data Validation Data Test Data

X_train, Y_Train X_Val, Y_Val X_Test, Y_Test

Machine Learning Y_Val’= Y_Test’=

(Algorithm,Options) Model.predict(X_Val) Model.predict(X_Test)
Pr
Model Pr

Performance Evaluation Performance Evaluation

Wael Ouarda - CRNS 2

2. Web Scraping Tools

Wael Ouarda - CRNS

2. Web Scraping Tools

• Open source python libraries and

frameworks for web scraping:
• Textual Content:
• Newspaper3k: send an HTTP request to the
website’s server to retrieve the data displayed on the
target web page;
• BeautifulSoup: a python library designed to parse
data, i.e., to extract data from HTML or XML
documents;
• Selenium: Selenium is a web driver designed to
render web pages like your web browser would for
the purpose of automated testing of web applications;
• Scrapy: complete web scraping frameworks
designed explicitly for the job of scraping the web.
• Visual Content:
• MechanicalSoup: a python library designed to parse data,
i.e., to extract url and hypertext from webpages.

Wael Ouarda - CRNS

3. Libraries & Frameworks

• Newspaper3k: Scraping data;

• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
punctuation ,etc);
• re: Regular Expression.

Wael Ouarda - CRNS 5

4. Cleaning process

1. Tokenization: Split document into list of words

2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
will be considered as play
8. Spell check
9. Translation
Wael Ouarda - CRNS 6
4. Cleaning process: Regular Expression (re)

Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
functions
It will process the text shared on web or on social media as String

• \d : Matches any decimal digit; this is equivalent to the class [0-9].

• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @ @ @
Wael Ouarda - CRNS 7
4. Cleaning process: Regular Expression (re)

Pattern Description

^ Matches beginning of line (^ab means that it starts with ab)

$ Matches end of line. ($a means that it ends with a)

. Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)

[...] Matches any single character in brackets.

[^...] Matches any single character not in brackets

Wael Ouarda - CRNS 8

Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1. Tokenization: Split document into list of Tokenization
words
2. Lower casing: Transform Upper case to
[Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
lower case
Punctuation Removal
3. Stop words removal: Stop words is a list
of words=[‘When”, “I”, “How”, … ] (It
can be modified by removing some [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
words by adding other ones
4. Special Character removal: @#’” etc Special Character Removal
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: [Hi,How,are,you,I,am,very,content,to,see,you,today]
player players plaied plays -> play
7. Lemmatization: have and had will be Lower case
considered have plays and played will
be considered as play [hy,how,are,you,i,am,very,content,to,see,you,today]
8. Spell check Translation & Spell check
9. Translation [hi,how,are,you,i,am,very,happy,to,see,you,today]
[very,happy,see,today] Stop words removal Stop words removal
Wael Ouarda - CRNS
[very,happiness,see,today] 9
5. Sample of NLP Libraries for sentiment analysis

Sentiment = is a tuple of (Polarity, Subjectivity)

• Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
• Subjectivity in [0,1]: Weight of Subjectivity of the text.

Data Data
Data Collection Data Cleaning
Representation Classification

Wael Ouarda - CRNS 10

6. Word Embedding Techniques (TF-IDF)

TF-IDF: Term Frequency – Inverse Document Frequency

Terminology
• t — term (word)
• d — document (set of words) user Tweets Label
• N — count of corpus
• Corpus — the total document set Id1 Tweet 11 = [« word 111 », « word 112 »] -> TF +
= [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d Id1 Tweet 12 +
DF(t) = occurrence of t in documents (IDF=N/df)
Id2 Tweet 21 -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))

Wael Ouarda - CRNS 11

6. Word Embedding Techniques (TF-IDF)
TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
Activity TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df) TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1)) TF-IDF(‘bonsoir’)
TF-IDF(‘leaders’) = 1*log(7/3)
TF-IDF(‘souhaite’)
user Tweets Label TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1 [bonjour, ali, bienvenue, leaders] +
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, +
ahmed]
[log(7/2), log (7/3), log(7/3)]
Id2 [bonsoir, ali, ahmed] - [log(7/3), log(7/3), , log(7/3)]

N-gram to include context (N=3)

[bonjour, ali, bienvenue] [ali, bienvenue, leaders]

[bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
[bonsoir, ali, ahmed]

Wael Ouarda - CRNS 12

6. Word Embedding Techniques (Word2Vec)

Features Vector
Term = “machine”

Word Identification in
the Vocabulary Bag of Words Neural Network Training
Yes/No

prediction prediction
WordNet is the dictionary (N) in default Neural Network
0 0
0 0
… W V …
“machine” 1 1
Error: Out of vocabulary
0 0
0 13 0

Wael Ouarda - CRNS 13

6. Word Embedding Techniques (Word2Vec)

Some facts about the Autoencoder:

● Image Representation in Low Dimensional Level
● It is an unsupervised learning algorithm (like PCA) z = f(Wx)
y = g(Vz)
● It minimizes the same objective function as PCA X=Input Vector
● It is a neural network X’: Output Vector
X=X’
● The neural network’s target output is its input
W V
Possible Derivatives of Autoencoder

Stacked Autoencoder Sparse Autoencoder

Wael Ouarda - CRNS 14

6. Word Embedding Techniques (Word2Vec)
user Tweets Label

Activity Id1 [bonjour, ali, bienvenue, leaders] +

N=4 size of vocabulary
[bonjour, ali, bienvenue] Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed] +
W is the size of features vector
Id2 [bonsoir, ali, ahmed] -
0 v11

1
Bonjour N-gram to include context (N=3)
0
0 V1w

0 (V11+V21+V31)/3
V21
ali 0 Input Weight Matrix
(4,W) Final Features
1 Vectors

0 (V1W+V2W+V3W)/
V2w
3
0
v31
bienvenue 0
0
V3w
1 Wael Ouarda - CRNS 15
7. Features Selection, Analysis and Transformation

• Transformation
• Linear Transformation: Principal Component Analysis (PCA)
• Non-Linear Transformation: Auto encoder
• Selection
• Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
Optimization, etc.
• Statistical Methods: Correlation Matrix

Wael Ouarda - CRNS 16

7. Features Selection, Analysis and Transformation
A given dataset of size N features and M samples
Correlation Matrix
Correlation Matrix is based on Pearson moment

M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)

Example: N=3 N=2 (Feature I & III) or (Features II & III)

Feature I Feature II Features III

Feature I M(I,I) = 1 M(II,I) M(III,I) Features I & II are high correlated. So we

Can drop one among it
Feature II M(I,II) M(II,II) = 1 M(III,II)

Feature III M(I,III) M(II,III) M(III,III) = 1

M is in [-1;1]
Feature I Feature II Features III
[-1;-0,5] ]-0,5;0] ]0;0,5] ]0,5;1]
Feature I M(I,I) = 1 0,6 -0,2
M(I,J) I & J are I & J are not I & J are not I & J are high Feature II 0,6 M(II,II) = 1 0,001
inversely High inversely high correlated
high correlated correlated
correlated
Feature III -0,2 0,001 M(III,III) = 1
Wael Ouarda - CRNS 17
7. Features Selection, Analysis and Transformation
Principal Component Analysis

Compute Average For i=1:N

A = 1/N * Sum(Vi) Adjustment of the Dataset
Vector Va = Vi - A

Dataset {Vi}

Adjustment of
the Dataset Adjustment of
the Dataset
Sort for the proper
vector
85%

V1 3/8

V2 8/8
Dataset {Vai} adjusted
V3 2/8 Example: Vector1= a1*v1 + a2*v2 + … an*vn

V4 7/8 Singular Value

Compute the N proper vectors (vi) Decomposition
Each vector from the old dataset Transform the dataset into
can be described as a weighted matrix N*N (N features)
Vn sum of the proper vectors
Wael Ouarda - CRNS 18
8. NLP Applications

• NLP Classification
• Spam & Ham Detector
• Fake News Detecor
• Sentiment Analysis
• NLP Topic Modeling
• Word Cloud Visualisation
• Clustering data/ User -> Communities
• Chatbot
• Natural Lanagage Processing (NLP): to process the natural lanagege input by human
• Natural Lanagage Generation (NLG): to generate response to human

Wael Ouarda - CRNS 19

Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Speech and Language Processing
100% (1)
Speech and Language Processing
623 pages
God Bless You, Mr. Rosewater
100% (4)
God Bless You, Mr. Rosewater
196 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Book
No ratings yet
NLP Book
599 pages
English Syllabus Vol1
No ratings yet
English Syllabus Vol1
31 pages
2021 - PQLI Advancing Innovation & Regulation - Pharma Eng 2021
No ratings yet
2021 - PQLI Advancing Innovation & Regulation - Pharma Eng 2021
10 pages
Speech and Language Processing - J&M
No ratings yet
Speech and Language Processing - J&M
599 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Module5 DS PPT
No ratings yet
Module5 DS PPT
38 pages
Fuel Consumption
100% (12)
Fuel Consumption
37 pages
Likert Scale in Social Sciences Research: Problems and Difficulties
No ratings yet
Likert Scale in Social Sciences Research: Problems and Difficulties
14 pages
Ed3book (001 282)
No ratings yet
Ed3book (001 282)
282 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Naveen
No ratings yet
Naveen
22 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
Red Hat 3scale API Management 2.8
No ratings yet
Red Hat 3scale API Management 2.8
74 pages
Mod 1
No ratings yet
Mod 1
71 pages
A Word Sense Induction Model
No ratings yet
A Word Sense Induction Model
66 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Ruturaj Barik - Andhra Pradesh, India - Professional Profile - LinkedIn
No ratings yet
Ruturaj Barik - Andhra Pradesh, India - Professional Profile - LinkedIn
11 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Draft: Natural Language Processing For The Working Programmer
No ratings yet
Draft: Natural Language Processing For The Working Programmer
79 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Chapter 1
No ratings yet
Chapter 1
78 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
Module 3
No ratings yet
Module 3
40 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Kabir Das and Guru Nanak
No ratings yet
Kabir Das and Guru Nanak
7 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Birth and Death
No ratings yet
Birth and Death
36 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
1 NLP
No ratings yet
1 NLP
26 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Adaptive and Innate Immunity
No ratings yet
Adaptive and Innate Immunity
32 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Text Mining
No ratings yet
Text Mining
34 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Fourth To Eight Week of Embryo
No ratings yet
Fourth To Eight Week of Embryo
31 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Document From Samir
No ratings yet
Document From Samir
24 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
BAI601 All Modules VTU 10 Mark Complete
No ratings yet
BAI601 All Modules VTU 10 Mark Complete
18 pages
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
No ratings yet
Q1. Handling Noisy Test in NLP.: 1. Data Cleaning and Preprocessing
23 pages
Ai CT-2 Answers
No ratings yet
Ai CT-2 Answers
20 pages
(H) VISem BCH6.2 GST Week3 AnkitaTomar
No ratings yet
(H) VISem BCH6.2 GST Week3 AnkitaTomar
23 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Northern Area Armed Forces Hospital
No ratings yet
Northern Area Armed Forces Hospital
14 pages
Holmium: Holmium Is A Chemical Element With The
No ratings yet
Holmium: Holmium Is A Chemical Element With The
8 pages
Mathematics Grade 10 Term 3 Week 5 - 2020
No ratings yet
Mathematics Grade 10 Term 3 Week 5 - 2020
5 pages
VP Institutional Asset Management in Washington DC Resume Lisa Drazin
100% (1)
VP Institutional Asset Management in Washington DC Resume Lisa Drazin
3 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
7 pages
Advertisement No. 21751 Dated 10/11/2015 Application Form No
No ratings yet
Advertisement No. 21751 Dated 10/11/2015 Application Form No
6 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Pipeline
No ratings yet
Pipeline
9 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
H.NO 9-4-110/3/56, Virasath Nagar, Tolichowki, HYDERABAD-500008 Andhra Pradesh PHONE NO: 04023562340/04024571530
No ratings yet
H.NO 9-4-110/3/56, Virasath Nagar, Tolichowki, HYDERABAD-500008 Andhra Pradesh PHONE NO: 04023562340/04024571530
8 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
Unit 5
No ratings yet
Unit 5
8 pages
Book 6
No ratings yet
Book 6
3 pages
Diversity and Discrimination - Notes
No ratings yet
Diversity and Discrimination - Notes
2 pages
Rakesh Jhunjhunwala Portfolio - October 2011
No ratings yet
Rakesh Jhunjhunwala Portfolio - October 2011
1 page
State Response To Lanesborough Board of Selectmen Open Meeting Law Complaint
No ratings yet
State Response To Lanesborough Board of Selectmen Open Meeting Law Complaint
2 pages
Vaishnava Etiquette - Final
No ratings yet
Vaishnava Etiquette - Final
5 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
Course Outline 2730G, 2025
No ratings yet
Course Outline 2730G, 2025
3 pages
Montessori Education - Wikipedia
No ratings yet
Montessori Education - Wikipedia
1 page
Resume KNM12
No ratings yet
Resume KNM12
1 page
Reading Test 2nd Cse A Group 1
No ratings yet
Reading Test 2nd Cse A Group 1
1 page
2023 - Changes in Grid Strength - WECC
No ratings yet
2023 - Changes in Grid Strength - WECC
1 page
Mastering Java: A Comprehensive Guide to Development Tools and Techniques
From Everand
Mastering Java: A Comprehensive Guide to Development Tools and Techniques
Lena Neill
No ratings yet
SQL for Beginners: A Guide to Excelling in Coding and Database Management
From Everand
SQL for Beginners: A Guide to Excelling in Coding and Database Management
Vere salazar
No ratings yet