0% found this document useful (0 votes)
20 views19 pages

NLP Part

Here are the key steps in calculating TF-IDF for terms in documents: 1. Calculate TF (Term Frequency) of each term in a document. TF is the number of occurrences of the term in the document divided by the total number of terms in the document. 2. Calculate DF (Document Frequency) of each term in the corpus. DF is the number of documents containing the term. 3. Calculate IDF (Inverse Document Frequency) of each term using the formula: IDF = log(N/DF) where N is the total number of documents. 4. Calculate TF-IDF by multiplying TF and IDF: TF-IDF = TF * IDF. This gives more weight to terms that occur frequently in a

Uploaded by

아이 커Iker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views19 pages

NLP Part

Here are the key steps in calculating TF-IDF for terms in documents: 1. Calculate TF (Term Frequency) of each term in a document. TF is the number of occurrences of the term in the document divided by the total number of terms in the document. 2. Calculate DF (Document Frequency) of each term in the corpus. DF is the number of documents containing the term. 3. Calculate IDF (Inverse Document Frequency) of each term using the formula: IDF = log(N/DF) where N is the total number of documents. 4. Calculate TF-IDF by multiplying TF and IDF: TF-IDF = TF * IDF. This gives more weight to terms that occur frequently in a

Uploaded by

아이 커Iker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Artificial Intelligence for Natural Language Processing (NLP)

Part II – From Word to Numerical Analysis


Dr. Eng. Wael Ouarda
Assistant Professor, CRNS, Higher Education Ministry, Tunisia

Centre de Recherche en Numérique de Sfax , Route de Tunis km 10 , Sakiet Ezzit , 3021 Sfax – Tunisie

Wael Ouarda - CRNS 1


1. Machine Learning algorithm for NLP
100 persons
7 emotions

Data Scrapping
85 persons 15 persons
Train&Val Test
Data Cleaning
Pr
Data Representation Model
Word Embedding Embedding
85 * 0,8 85 * 0,2
Train Validation
Data Partitioning

Train Data Validation Data Test Data


X_train, Y_Train X_Val, Y_Val X_Test, Y_Test

Machine Learning Y_Val’= Y_Test’=


(Algorithm,Options) Model.predict(X_Val) Model.predict(X_Test)
Pr
Model Pr

Performance Evaluation Performance Evaluation

Wael Ouarda - CRNS 2


2. Web Scraping Tools

Wael Ouarda - CRNS


2. Web Scraping Tools

• Open source python libraries and


frameworks for web scraping:
• Textual Content:
• Newspaper3k: send an HTTP request to the
website’s server to retrieve the data displayed on the
target web page;
• BeautifulSoup: a python library designed to parse
data, i.e., to extract data from HTML or XML
documents;
• Selenium: Selenium is a web driver designed to
render web pages like your web browser would for
the purpose of automated testing of web applications;
• Scrapy: complete web scraping frameworks
designed explicitly for the job of scraping the web.
• Visual Content:
• MechanicalSoup: a python library designed to parse data,
i.e., to extract url and hypertext from webpages.

Wael Ouarda - CRNS


3. Libraries & Frameworks

• Newspaper3k: Scraping data;


• Facebook Scrapper;
• Pandas: IO files;
• Seaborn: Statistics;
• Numpy: Array use;
• NLTK: Natural Language Toolkit (Dictionary (Graph=WordNet), Stopwords,
punctuation ,etc);
• re: Regular Expression.

Wael Ouarda - CRNS 5


4. Cleaning process

1. Tokenization: Split document into list of words


2. Lower casing: Transform Upper case to lower case
3. Stop words removal: Stop words is a list of words=[‘When”, “I”, “How”, …
] (It can be modified by removing some words by adding other ones
4. Special Character removal: @#’” etc
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: player players plaied plays -> play
7. Lemmatization: have and had will be considered have plays and played
will be considered as play
8. Spell check
9. Translation
Wael Ouarda - CRNS 6
4. Cleaning process: Regular Expression (re)

Examples: @ali, @ahmed, #, ‘e’, ‘A12’, ‘A13’, … Can not be removed using NLTK
functions
It will process the text shared on web or on social media as String

• \d : Matches any decimal digit; this is equivalent to the class [0-9].


• \D: Matches any non-digit character; this is equivalent to the class [^0-9].
• \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
• \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
• \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-
9_].
• Exemple: Re.sub(r’,[^@],’ ‘) => @ @ @ @
Wael Ouarda - CRNS 7
4. Cleaning process: Regular Expression (re)

Pattern Description

^ Matches beginning of line (^ab means that it starts with ab)

$ Matches end of line. ($a means that it ends with a)

. Matches any single character except newline. Using m option allows it to match newline as well. (Etc …)

[...] Matches any single character in brackets.

[^...] Matches any single character not in brackets

Wael Ouarda - CRNS 8


Hi? How are you, I am very content to see you today :)!
4. Cleaning process
1. Tokenization: Split document into list of Tokenization
words
2. Lower casing: Transform Upper case to
[Hi,?,,How,are,you,,,I,am,very,content,to,see,you,today, :,),!]
lower case
Punctuation Removal
3. Stop words removal: Stop words is a list
of words=[‘When”, “I”, “How”, … ] (It
can be modified by removing some [Hi,How,are,you,I,am,very,content,to,see,you,today,)]
words by adding other ones
4. Special Character removal: @#’” etc Special Character Removal
5. Punctuation removal: :,;-,?! etc
6. Stemming: take the basic of the word: [Hi,How,are,you,I,am,very,content,to,see,you,today]
player players plaied plays -> play
7. Lemmatization: have and had will be Lower case
considered have plays and played will
be considered as play [hy,how,are,you,i,am,very,content,to,see,you,today]
8. Spell check Translation & Spell check
9. Translation [hi,how,are,you,i,am,very,happy,to,see,you,today]
[very,happy,see,today] Stop words removal Stop words removal
Wael Ouarda - CRNS
[very,happiness,see,today] 9
5. Sample of NLP Libraries for sentiment analysis

Sentiment = is a tuple of (Polarity, Subjectivity)


• Polarity in [-1 (Negative),1(Positive)]: The orientation of opinion behind the text;
• Subjectivity in [0,1]: Weight of Subjectivity of the text.

Data Data
Data Collection Data Cleaning
Representation Classification

Wael Ouarda - CRNS 10


6. Word Embedding Techniques (TF-IDF)

TF-IDF: Term Frequency – Inverse Document Frequency


Terminology
• t — term (word)
• d — document (set of words) user Tweets Label
• N — count of corpus
• Corpus — the total document set Id1 Tweet 11 = [« word 111 », « word 112 »] -> TF +
= [0,5, 0,5]
TF(t,d) = count of t in d / number of words in d Id1 Tweet 12 +
DF(t) = occurrence of t in documents (IDF=N/df)
Id2 Tweet 21 -
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1))

Wael Ouarda - CRNS 11


6. Word Embedding Techniques (TF-IDF)
TF-IDF(‘bonjour’,id1) = tf(bonjour,id1) * log (N/1)= 1 * log(7/2)
Activity TF-IDF(‘Ali’,id1) = tf(‘ali’, id1) * log (7/df(‘ali)) = 1 * log (7/3)
TF-IDF(‘Ali’,id2) = tf(‘ali’, id2) * log (7/df(‘ali)) = 1 * log (7/3)
TF(t,d) = count of t in d / number of words in d TF-IDF(‘Ahmed,id1’) = 2 * log (7/4)
DF(t) = occurrence of t in documents (IDF=N/df) TF-IDF(‘Ahmed,id2’)
TF-IDF(t, d) = tf(t, d) * log(N/(df + 1)) TF-IDF(‘bonsoir’)
TF-IDF(‘leaders’) = 1*log(7/3)
TF-IDF(‘souhaite’)
user Tweets Label TF-IDF(‘bienvenue’) = 1* log(7/3)
Id1 [bonjour, ali, bienvenue, leaders] +
[bonjour, ali, bienvenue] [ali, bienvenue, leaders]
Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, +
ahmed]
[log(7/2), log (7/3), log(7/3)]
Id2 [bonsoir, ali, ahmed] - [log(7/3), log(7/3), , log(7/3)]

N-gram to include context (N=3)

[bonjour, ali, bienvenue] [ali, bienvenue, leaders]


[bonsoir, ahmed, leaders] [ahmed, leaders, souhaite] [leaders, souhaite, bienvenue]
[bonsoir, ali, ahmed]

Wael Ouarda - CRNS 12


6. Word Embedding Techniques (Word2Vec)

Features Vector
Term = “machine”

Word Identification in
the Vocabulary Bag of Words Neural Network Training
Yes/No

prediction prediction
WordNet is the dictionary (N) in default Neural Network
0 0
0 0
… W V …
“machine” 1 1
Error: Out of vocabulary
0 0
0 13 0

Wael Ouarda - CRNS 13


6. Word Embedding Techniques (Word2Vec)

Some facts about the Autoencoder:


● Image Representation in Low Dimensional Level
● It is an unsupervised learning algorithm (like PCA) z = f(Wx)
y = g(Vz)
● It minimizes the same objective function as PCA X=Input Vector
● It is a neural network X’: Output Vector
X=X’
● The neural network’s target output is its input
W V
Possible Derivatives of Autoencoder

Stacked Autoencoder Sparse Autoencoder

Wael Ouarda - CRNS 14


6. Word Embedding Techniques (Word2Vec)
user Tweets Label

Activity Id1 [bonjour, ali, bienvenue, leaders] +


N=4 size of vocabulary
[bonjour, ali, bienvenue] Id1 [bonsoir, ahmed, leaders, souhaite, bienvenue, ahmed] +
W is the size of features vector
Id2 [bonsoir, ali, ahmed] -
0 v11

1
Bonjour N-gram to include context (N=3)
0
0 V1w

0 (V11+V21+V31)/3
V21
ali 0 Input Weight Matrix
(4,W) Final Features
1 Vectors

0 (V1W+V2W+V3W)/
V2w
3
0
v31
bienvenue 0
0
V3w
1 Wael Ouarda - CRNS 15
7. Features Selection, Analysis and Transformation

• Transformation
• Linear Transformation: Principal Component Analysis (PCA)
• Non-Linear Transformation: Auto encoder
• Selection
• Heuristic Methods: Genetic Algorithm, Particle Swarm Optimization, Ant Colony
Optimization, etc.
• Statistical Methods: Correlation Matrix

Wael Ouarda - CRNS 16


7. Features Selection, Analysis and Transformation
A given dataset of size N features and M samples
Correlation Matrix
Correlation Matrix is based on Pearson moment

M(feature I, feature J) = covariance (I,J) / Variance(I) * Variance (J)

Example: N=3 N=2 (Feature I & III) or (Features II & III)

Feature I Feature II Features III

Feature I M(I,I) = 1 M(II,I) M(III,I) Features I & II are high correlated. So we


Can drop one among it
Feature II M(I,II) M(II,II) = 1 M(III,II)

Feature III M(I,III) M(II,III) M(III,III) = 1

M is in [-1;1]
Feature I Feature II Features III
[-1;-0,5] ]-0,5;0] ]0;0,5] ]0,5;1]
Feature I M(I,I) = 1 0,6 -0,2
M(I,J) I & J are I & J are not I & J are not I & J are high Feature II 0,6 M(II,II) = 1 0,001
inversely High inversely high correlated
high correlated correlated
correlated
Feature III -0,2 0,001 M(III,III) = 1
Wael Ouarda - CRNS 17
7. Features Selection, Analysis and Transformation
Principal Component Analysis

Compute Average For i=1:N


A = 1/N * Sum(Vi) Adjustment of the Dataset
Vector Va = Vi - A

Dataset {Vi}

Adjustment of
the Dataset Adjustment of
the Dataset
Sort for the proper
vector
85%

V1 3/8

V2 8/8
Dataset {Vai} adjusted
V3 2/8 Example: Vector1= a1*v1 + a2*v2 + … an*vn

V4 7/8 Singular Value


Compute the N proper vectors (vi) Decomposition
Each vector from the old dataset Transform the dataset into
can be described as a weighted matrix N*N (N features)
Vn sum of the proper vectors
Wael Ouarda - CRNS 18
8. NLP Applications

• NLP Classification
• Spam & Ham Detector
• Fake News Detecor
• Sentiment Analysis
• NLP Topic Modeling
• Word Cloud Visualisation
• Clustering data/ User -> Communities
• Chatbot
• Natural Lanagage Processing (NLP): to process the natural lanagege input by human
• Natural Lanagage Generation (NLG): to generate response to human

Wael Ouarda - CRNS 19

You might also like