0% found this document useful (0 votes)

81 views50 pages

Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU

The document discusses text classification and summarizes common methods for text classification including supervised machine learning approaches. It provides an example of sentiment analysis using a movie reviews dataset and describes preprocessing text data, creating feature vectors using bag-of-words, and using classifiers like naive Bayes and logistic regression for the sentiment analysis task. The document also discusses evaluating precision, recall, and the F1 score for classification models.

Uploaded by

Kỉ Nguyên Hủy Diệt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views50 pages

Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU

Uploaded by

Kỉ Nguyên Hủy Diệt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Text Classification

Dr. Nguyen Van Vinh

CS Department – UET, Hanoi VNU
Is this spam?
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and some
great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing
scenes.

3
4
Content
• Text classification problem
• Feature extraction from text
• Naïve Bayes method
• Logistic Regression method
• Sentiment Analysis Casestudy
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class c  C

Classification Methods:
Hand-coded rules
• Choose features
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c

8
Machine Learning Artchitecture
Learning(Học) Training
Labels
Training
Samples
Learned
Features Training
model

Learned
model
Inference(Dự đoán)

Features Prediction

Test Sample
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
• Neural Network
• …
Precision and recall
• Precision: % of selected items that are correct
Recall: % of correct items that are selected

correct not correct

selected tp fp
not selected fn tn
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F measure
(weighted harmonic mean):
1 ( b 2 + 1) PR
F= =
1
a + (1 - a )
1 b 2
P+R
P R
• The harmonic mean is a very conservative average; see IIR §
8.3
• People usually use balanced F1 measure
• i.e., with  = 1 (that is,  = ½): F = 2PR/(P+R)
Features
• A measurable variable that is (rather, should be) distinctive of
something we want to model.
• We usually choose features that are useful to identify
something, i.e., to do classification
• Ex: Cô gái đó rất đẹp trong bữa tiệc hôm đó.
• We often need several features to adequately model something
– but not too many!

13
Feature vectors
• Values for several features of an observation can be put into a
single vector
# proper # 1st person # commas
nouns pronouns
2 0 0

5 0 0

0 1 1

14
Feature vectors
• Features should be useful in discriminating between categories.

15
Classification: Sentiment Analysis

• Surface cues can basically tell you what’s going on here:

presence or absence of certain words (great, awful)
• Steps to classification:
• Turn examples like this into feature vectors
• Pick a model / learning algorithm
• Train weights on data to get our classifier

16
Feature Representation
• Convert this example to a vector using bag-of-words features

• Very large vector space (size of vocabulary), sparse features

• Requires indexing the features (mapping them to axes)
• More sophisticated feature mappings possible (m-idf), as well as lots
of other features: character n-grams, parts of speech, lemmas, …
17
Sec. 15.3.1

Accuracy as a function of data size

• With enough data

• Classifier may not matter

18
Brill and Banko on spelling correction
Naive Bayes: Exercise!
• Model

• Inference

• Learning: maximize P(x,y) by reading counts off the data

19
LOGISTIC REGRESSION
MODEL

20
Logistic Regression

21
Logistic Regression

22
Sigmoid Function
• Trong số các hàm có 2 tích chất trên thì hàm sigmoid: f (s)  1 1e được sử
s

dụng rộng rãi.

• f’(s) = f(s) (1-f(s)).

23
Optimization of loss based on gradient decent
In more detail about Gradient
decent:
https://towardsdatascience.com/imp
lement-gradient-descent-in-python-
9b93ed7108d1

• It is an optimization algorithm to
find the minimum of a function.
We start with a random point on
the function and move in the
negative direction of the
gradient of the function to reach
the local/global minima.

24
Gradient descent
• Nếu đạo hàm của hàm số x tại xt: f′(xt) >0 f′(xt)>0 thì xt nằm
về bên phải so với x* (và ngược lại). Để điểm tiếp theo xt +1
gần với x* hơn, chúng ta cần di chuyển xt về phía bên trái,
tức về phía âm. Nói các khác, chúng ta cần di chuyển
ngược dấu với đạo hàm: xt+1 = xt + ∆
• xt+1 = xt - µ f’(xt )

25
Gradient descent

26
Gradient Descent Algorithm

Variants of GD algorithm:
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

27
Pros and cons of gradient descent
• Simple and often quite effective on ML tasks
• Often very scalable
• Only applies to smooth functions (differentiable)
• Might find a local minimum, rather than a global
one

28
Visualization of algorithms of GD

29
Selecting learning rate
• Use grid-search in log-space over small values on a tuning set:
• e.g., 0.01, 0.001, …
• Sometimes, decrease after each pass:
• e.g factor of 1/(1 + dt), t=epoch
• sometimes 1/t2
• Fancier techniques:
• Adaptive gradient: scale gradient differently for each dimension
(Adagrad, ADAM, ….)

30
Logistic Regression: Summary

31
SENTIMENT ANALYSIS

32
Google Product Search

• a

33
Bing Shopping

• a

34
Sentiment Analysis task
• Input: Cho một câu (đoạn) văn bản.
• Ouput: Xác định câu này là câu tích cực (1) hoặc tiêu cực (0).
• Thưc hiện trên bộ dữ liệu review phim ảnh: IMDB

35
Sentiment Analysis
• Chuẩn bị dữ liệu: IMDB Movie Reviews
• Tiền xử lý dữ liệu
• Trích trọn đặc trưng (Vectorization)
• Chọn mô hình học máy và huấn luyên

36
Tiền xử lý dữ liệu
• Đơn vị nhỏ nhất là từ. Dãy các từ:
Ví dụ: Text: This is a cat. --> Word Sequence: [this, is, a, cat]

• Dữ liệu crawl từ web thường rất là “dirty” như mã Html, các từ

viết tắt, … nên cẩn phải tiền xử lý dữ liệu
• Dùng biểu thức chính qui thực hiện

37
Biểu thức chính qui
• Biêu thức chính qui (or regex) is a sequence of characters that
represent a search pattern
• Each character has a meaning; for example, “. means any
character that isn't the newline character: '\n‘”
• These characters are often combined with quantifiers, such as *,
which means zero or more.
• Biểu thức chính qui rất có ích trong việc xử lý xâu ký tự

38
Biểu thức chính qui

39
Quiz
• Example:
1) Trính ra số từ s:
s = 'My 2 favourite numbers are 8 and 25. My mobie is
0912203062‘.
1) Trích ra email từ s:
s = ‘Hello from [email protected] to [email protected]
about the meeting @2PM'

40
Features extraction
• Now that we have a way to extract information from text in the
form of word sequences, we need a way to transform these
word sequences into numerical features: this is vectorization.
• Kỹ thuật vector hóa đơn giản nhất là Bag Of Words (BOW). It
starts with a list of words called the vocabulary (this is often all
the words that occur in the training data)

41
Features extraction
• To use BOW vectorization in Python, we can rely on
CountVectorizer from the scikit-learn library
• scikit-learn has a built-in list of stop words that can
be ignored by passing stop_words="english" to the
vectorizer
• Moreover, we can pass our custom pre-processing function
fromearlier to automatically clean the text before it’s
vectorized.

Training texts: ["This is a good cat", "This is a bad day"]=>

vocabulary: [this, cat, day, is, good, a, bad]
New text: "This day is a good day" --> [1, 0, 2, 1, 1, 1, 0]

42
Sentiment Analysis CaseStudy
• Dữ liệu văn bản IMDB: The IMDB movie reviews dataset is a set
of 50,000 reviews, half of which are positive and the other half
negative
• Chúng ta có thể download dữ liệu từ:
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
• Chúng ta có thư mục chứa dữ liệu là: aclImdb
• we can use the following function to load the training/test datasets from
IMDB

43
Sentiment Analysis CaseStudy

• feature vectors that result from BOW are usually very large
(80,000-dimensional vectors in this case)
• we need to use simple algorithms that are efficient on a large
number of features (e.g., Naive Bayes, linear SVM, or logistic
regression)

44
Cải tiến mô hình hiện tại
• Trích trọng đặc trưng rất là quan trọng (Features Engineering
• There are some biases attached with only looking at how many
times a word occurs in a text. In particular, the longer the text,
the higher its features (word counts) will be
• Dựa vào đặc trưng TF-IDF
• Dựa vào n-gram

45
Cải tiến mô hình
• Đặc trưng TF-IDF
• Chúng ta có thể huấn luyện mô hình Linear SVM với đặc
trưng TF-IDF đơn giản bằng việc thay thế hàm
CountVectorizer bằng hàm TfIdfVectorizer
• Kết quả tăng khoảng 2%

46
Cải tiến mô hình
• Sử dụng các từ độc lập sẽ không tốt. Ví dụ:
• if the word good occurs in a text, we will naturally tend to
say that this text is positive, even if the actual expression
that occurs is actually not good. Cụm từ tốt hơn
• Sử dụng n-gram để xử lý vấn đề này
• An N-gram is a set of N successive words (e.g., very
good [ 2-gram] and not good at all [4-gram]).
Using N-grams, we produce richer word sequences.
• Ví dụ với N =2:
This is a cat. --> [this, is, a, cat, (this, is), (is, a), (a, cat)]
47
Cải tiến mô hình
• In practice, including N-grams in our TF-IDF vectorizer
is as simple as providing an additional parameter
ngram_range=(1, N).
vectorizer = TfidfVectorizer(stop_words="english",
preprocessor=clean_text,
ngram_range=(1, 2))

48
Homework
• Phân tích quan điểm với dữ liệu chuẩn IMDB (link:
http://ai.stanford.edu/~amaas/data/sentiment/)

49
References
• Chapter 4,5 - Speech and Language Processing (3rd ed. draft)
(https://web.stanford.edu/~jurafsky/slp3/4.pdf)
(https://web.stanford.edu/~jurafsky/slp3/5.pdf)

Candy M. Beal, Cheryl Mason Bolick - Teaching Social Studies in Middle and Secondary Schools (
No ratings yet
Candy M. Beal, Cheryl Mason Bolick - Teaching Social Studies in Middle and Secondary Schools (
741 pages
Introduction To Robotics: Phillip Mckerrow
No ratings yet
Introduction To Robotics: Phillip Mckerrow
1 page
Meaning and Intentionality in Wittgenstein's Later Philosophy
No ratings yet
Meaning and Intentionality in Wittgenstein's Later Philosophy
13 pages
SPWLA Curve Mnemonics - SPWLA World
0% (1)
SPWLA Curve Mnemonics - SPWLA World
1 page
Slide AI-ML-DL
No ratings yet
Slide AI-ML-DL
124 pages
2023 Logictic Regression VN
No ratings yet
2023 Logictic Regression VN
49 pages
(Description) Sentiment Analysis
No ratings yet
(Description) Sentiment Analysis
9 pages
Naive Bayes
No ratings yet
Naive Bayes
35 pages
Đại Học Quốc Gia Thành Phố Hồ Chí Minh Trường Đại Học Khoa Học Tự Nhiên Khoa Công Nghệ Thông Tin Bộ Môn Công Nghệ Tri Thức
No ratings yet
Đại Học Quốc Gia Thành Phố Hồ Chí Minh Trường Đại Học Khoa Học Tự Nhiên Khoa Công Nghệ Thông Tin Bộ Môn Công Nghệ Tri Thức
9 pages
Revise Machine Learning Final 20 - 06
No ratings yet
Revise Machine Learning Final 20 - 06
16 pages
Revise Machine Learning Final 20 - 06
No ratings yet
Revise Machine Learning Final 20 - 06
14 pages
Machine Learning Slide - Group 16
No ratings yet
Machine Learning Slide - Group 16
32 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
Stock Price Prediction
No ratings yet
Stock Price Prediction
12 pages
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
No ratings yet
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
26 pages
L9 Model Assessment
No ratings yet
L9 Model Assessment
26 pages
I. AI Is The New Electricity
No ratings yet
I. AI Is The New Electricity
14 pages
Deep Learning Project
No ratings yet
Deep Learning Project
18 pages
Mamba Code
No ratings yet
Mamba Code
12 pages
09 - ML-Model Evaluation
No ratings yet
09 - ML-Model Evaluation
41 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Baitap 2 basicML
No ratings yet
Baitap 2 basicML
3 pages
Lab 01 Ds Project 01
No ratings yet
Lab 01 Ds Project 01
10 pages
Tim Hieu Ve Deep Learning
100% (1)
Tim Hieu Ve Deep Learning
78 pages
IT5409 Ch7 Part1 Object Detection v2 Linhdt 2023
No ratings yet
IT5409 Ch7 Part1 Object Detection v2 Linhdt 2023
49 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Machine Learning Co Ban
No ratings yet
Machine Learning Co Ban
41 pages
Description XGBoost
No ratings yet
Description XGBoost
15 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
HMC Report-3
100% (2)
HMC Report-3
13 pages
Tin Hoc 12 - KHMT - Shs - Ban in Thu-145-149
No ratings yet
Tin Hoc 12 - KHMT - Shs - Ban in Thu-145-149
5 pages
AI.5-Machine Learning (21-26)
No ratings yet
AI.5-Machine Learning (21-26)
196 pages
Text Classification
No ratings yet
Text Classification
7 pages
Module 3 - Ensemble Learning
No ratings yet
Module 3 - Ensemble Learning
178 pages
ML Revision
No ratings yet
ML Revision
37 pages
Cs221 Report
No ratings yet
Cs221 Report
16 pages
Machine Learning Report
No ratings yet
Machine Learning Report
22 pages
Unit 4 Classification (1) (P)
No ratings yet
Unit 4 Classification (1) (P)
50 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
09 - ML-Model Evaluation
No ratings yet
09 - ML-Model Evaluation
33 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Exercise Data Analysis
No ratings yet
Exercise Data Analysis
25 pages
Deep Learning Model With Hierarchical
No ratings yet
Deep Learning Model With Hierarchical
11 pages
Recommendation System
No ratings yet
Recommendation System
17 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
PPPT
No ratings yet
PPPT
20 pages
Thuat Toan KNN
No ratings yet
Thuat Toan KNN
6 pages
4.machine Learning For Text Understanding-1
No ratings yet
4.machine Learning For Text Understanding-1
45 pages
Vietnamese Text Clasification
No ratings yet
Vietnamese Text Clasification
7 pages
Text Classification
No ratings yet
Text Classification
32 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Aspect-Category Based Sentiment Analysis With Unified Sequence-To-Sequence Transfer Transformers
No ratings yet
Aspect-Category Based Sentiment Analysis With Unified Sequence-To-Sequence Transfer Transformers
11 pages
Đặng Mạnh Trường: Mục Tiêu Nghề Nghiệp
No ratings yet
Đặng Mạnh Trường: Mục Tiêu Nghề Nghiệp
2 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
ĐSTT
No ratings yet
ĐSTT
16 pages
L10 Thach Thuc Va Xu Huong
No ratings yet
L10 Thach Thuc Va Xu Huong
41 pages
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
No ratings yet
Anti-Spam Filter Based On Naïve Bayes, SVM, and KNN Model
5 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Machine Learning 2
No ratings yet
Machine Learning 2
17 pages
Rationale Rubric
No ratings yet
Rationale Rubric
2 pages
GE 1 LP 1 ANSWER SHEET Revised
No ratings yet
GE 1 LP 1 ANSWER SHEET Revised
3 pages
TRW Summarising
No ratings yet
TRW Summarising
4 pages
Flexible Instructional Delivery Plan (FIDP)
No ratings yet
Flexible Instructional Delivery Plan (FIDP)
5 pages
Psychology For Bs
No ratings yet
Psychology For Bs
12 pages
Critical: Appraisal Skills Programme Making
No ratings yet
Critical: Appraisal Skills Programme Making
5 pages
MCT Formative Assessment Report 1 Shaimaa Habeeb Mohsin Salem Al Saqqaf h00412506
No ratings yet
MCT Formative Assessment Report 1 Shaimaa Habeeb Mohsin Salem Al Saqqaf h00412506
2 pages
English 5: File Created by Deped Click
No ratings yet
English 5: File Created by Deped Click
1 page
A. B. C. D. A. B. C. D.: Câu 1: Which of The Following Is NOT A Minimal Pair?
No ratings yet
A. B. C. D. A. B. C. D.: Câu 1: Which of The Following Is NOT A Minimal Pair?
9 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Prudence Defence Academy: सबसे IMPORTANT CONCPETS वजनसे वकसी भी EXAM र्ें सबसे ज़्यादा प्रश्न पुछे जाते है I
No ratings yet
Prudence Defence Academy: सबसे IMPORTANT CONCPETS वजनसे वकसी भी EXAM र्ें सबसे ज़्यादा प्रश्न पुछे जाते है I
111 pages
Unit (6) Teaching Handwriting and Early Writing Skills
No ratings yet
Unit (6) Teaching Handwriting and Early Writing Skills
5 pages
Jade K Clark Resume April 19
No ratings yet
Jade K Clark Resume April 19
2 pages
TT Oman G1A Teacher's Book
No ratings yet
TT Oman G1A Teacher's Book
182 pages
What Are The Different Forces of Change in The Workplace
No ratings yet
What Are The Different Forces of Change in The Workplace
2 pages
Causative Form (Part 1)
No ratings yet
Causative Form (Part 1)
14 pages
(Adam Fox, Daniel Woolf) The Spoken Word Oral Culture
100% (3)
(Adam Fox, Daniel Woolf) The Spoken Word Oral Culture
297 pages
Introduction To Irony in Film
100% (1)
Introduction To Irony in Film
18 pages
The Effective Teacher
No ratings yet
The Effective Teacher
72 pages
Day 6 PPT Linguistics
No ratings yet
Day 6 PPT Linguistics
9 pages
UNLV Student: PSMT Name: Lesson Plan Title: Lesson Plan Topic: Date: Estimated Time: Grade Level: School Site: 1. State Standard(s)
No ratings yet
UNLV Student: PSMT Name: Lesson Plan Title: Lesson Plan Topic: Date: Estimated Time: Grade Level: School Site: 1. State Standard(s)
9 pages
Batuyanan Es - Teacher Requirement Analysis (Elem) Sy 2021-2022 v.2
No ratings yet
Batuyanan Es - Teacher Requirement Analysis (Elem) Sy 2021-2022 v.2
4 pages
2013 September UGC NET Solved Question Paper in Psychology
No ratings yet
2013 September UGC NET Solved Question Paper in Psychology
264 pages
Tools To Assess Curriculum: Deseree P. Bautista
100% (1)
Tools To Assess Curriculum: Deseree P. Bautista
63 pages
Inquiries, Investigation, and Immersion (Facebook Usage)
No ratings yet
Inquiries, Investigation, and Immersion (Facebook Usage)
38 pages
Types of Nonverbal Communication PDF
No ratings yet
Types of Nonverbal Communication PDF
23 pages