0% found this document useful (0 votes)
77 views50 pages

Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU

The document discusses text classification and summarizes common methods for text classification including supervised machine learning approaches. It provides an example of sentiment analysis using a movie reviews dataset and describes preprocessing text data, creating feature vectors using bag-of-words, and using classifiers like naive Bayes and logistic regression for the sentiment analysis task. The document also discusses evaluating precision, recall, and the F1 score for classification models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views50 pages

Text Classification: Dr. Nguyen Van Vinh CS Department - UET, Hanoi VNU

The document discusses text classification and summarizes common methods for text classification including supervised machine learning approaches. It provides an example of sentiment analysis using a movie reviews dataset and describes preprocessing text data, creating feature vectors using bag-of-words, and using classifiers like naive Bayes and logistic regression for the sentiment analysis task. The document also discusses evaluating precision, recall, and the F1 score for classification models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Text Classification

Dr. Nguyen Van Vinh


CS Department – UET, Hanoi VNU
Is this spam?
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and some
great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing
scenes.

3
4
Content
• Text classification problem
• Feature extraction from text
• Naïve Bayes method
• Logistic Regression method
• Sentiment Analysis Casestudy
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class c  C


Classification Methods:
Hand-coded rules
• Choose features
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c

8
Machine Learning Artchitecture
Learning(Học) Training
Labels
Training
Samples
Learned
Features Training
model

Learned
model
Inference(Dự đoán)

Features Prediction

Test Sample
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
• Neural Network
• …
Precision and recall
• Precision: % of selected items that are correct
Recall: % of correct items that are selected

correct not correct


selected tp fp
not selected fn tn
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F measure
(weighted harmonic mean):
1 ( b 2 + 1) PR
F= =
1
a + (1 - a )
1 b 2
P+R
P R
• The harmonic mean is a very conservative average; see IIR §
8.3
• People usually use balanced F1 measure
• i.e., with  = 1 (that is,  = ½): F = 2PR/(P+R)
Features
• A measurable variable that is (rather, should be) distinctive of
something we want to model.
• We usually choose features that are useful to identify
something, i.e., to do classification
• Ex: Cô gái đó rất đẹp trong bữa tiệc hôm đó.
• We often need several features to adequately model something
– but not too many!

13
Feature vectors
• Values for several features of an observation can be put into a
single vector
# proper # 1st person # commas
nouns pronouns
2 0 0

5 0 0

0 1 1

14
Feature vectors
• Features should be useful in discriminating between categories.

15
Classification: Sentiment Analysis

• Surface cues can basically tell you what’s going on here:


presence or absence of certain words (great, awful)
• Steps to classification:
• Turn examples like this into feature vectors
• Pick a model / learning algorithm
• Train weights on data to get our classifier

16
Feature Representation
• Convert this example to a vector using bag-of-words features

• Very large vector space (size of vocabulary), sparse features


• Requires indexing the features (mapping them to axes)
• More sophisticated feature mappings possible (m-idf), as well as lots
of other features: character n-grams, parts of speech, lemmas, …
17
Sec. 15.3.1

Accuracy as a function of data size

• With enough data


• Classifier may not matter

18
Brill and Banko on spelling correction
Naive Bayes: Exercise!
• Model

• Inference

• Learning: maximize P(x,y) by reading counts off the data


19
LOGISTIC REGRESSION
MODEL

20
Logistic Regression

21
Logistic Regression

22
Sigmoid Function
• Trong số các hàm có 2 tích chất trên thì hàm sigmoid: f (s)  1 1e được sử
s

dụng rộng rãi.


• f’(s) = f(s) (1-f(s)).

23
Optimization of loss based on gradient decent
In more detail about Gradient
decent:
https://towardsdatascience.com/imp
lement-gradient-descent-in-python-
9b93ed7108d1

• It is an optimization algorithm to
find the minimum of a function.
We start with a random point on
the function and move in the
negative direction of the
gradient of the function to reach
the local/global minima.

24
Gradient descent
• Nếu đạo hàm của hàm số x tại xt: f′(xt) >0 f′(xt)>0 thì xt nằm
về bên phải so với x* (và ngược lại). Để điểm tiếp theo xt +1
gần với x* hơn, chúng ta cần di chuyển xt về phía bên trái,
tức về phía âm. Nói các khác, chúng ta cần di chuyển
ngược dấu với đạo hàm: xt+1 = xt + ∆
• xt+1 = xt - µ f’(xt )

25
Gradient descent

26
Gradient Descent Algorithm

Variants of GD algorithm:
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

27
Pros and cons of gradient descent
• Simple and often quite effective on ML tasks
• Often very scalable
• Only applies to smooth functions (differentiable)
• Might find a local minimum, rather than a global
one

28
Visualization of algorithms of GD

29
Selecting learning rate
• Use grid-search in log-space over small values on a tuning set:
• e.g., 0.01, 0.001, …
• Sometimes, decrease after each pass:
• e.g factor of 1/(1 + dt), t=epoch
• sometimes 1/t2
• Fancier techniques:
• Adaptive gradient: scale gradient differently for each dimension
(Adagrad, ADAM, ….)

30
Logistic Regression: Summary

31
SENTIMENT ANALYSIS

32
Google Product Search

• a

33
Bing Shopping

• a

34
Sentiment Analysis task
• Input: Cho một câu (đoạn) văn bản.
• Ouput: Xác định câu này là câu tích cực (1) hoặc tiêu cực (0).
• Thưc hiện trên bộ dữ liệu review phim ảnh: IMDB

35
Sentiment Analysis
• Chuẩn bị dữ liệu: IMDB Movie Reviews
• Tiền xử lý dữ liệu
• Trích trọn đặc trưng (Vectorization)
• Chọn mô hình học máy và huấn luyên

36
Tiền xử lý dữ liệu
• Đơn vị nhỏ nhất là từ. Dãy các từ:
Ví dụ: Text: This is a cat. --> Word Sequence: [this, is, a, cat]

• Dữ liệu crawl từ web thường rất là “dirty” như mã Html, các từ


viết tắt, … nên cẩn phải tiền xử lý dữ liệu
• Dùng biểu thức chính qui thực hiện

37
Biểu thức chính qui
• Biêu thức chính qui (or regex) is a sequence of characters that
represent a search pattern
• Each character has a meaning; for example, “. means any
character that isn't the newline character: '\n‘”
• These characters are often combined with quantifiers, such as *,
which means zero or more.
• Biểu thức chính qui rất có ích trong việc xử lý xâu ký tự

38
Biểu thức chính qui

39
Quiz
• Example:
1) Trính ra số từ s:
s = 'My 2 favourite numbers are 8 and 25. My mobie is
0912203062‘.
1) Trích ra email từ s:
s = ‘Hello from [email protected] to [email protected]
about the meeting @2PM'

40
Features extraction
• Now that we have a way to extract information from text in the
form of word sequences, we need a way to transform these
word sequences into numerical features: this is vectorization.
• Kỹ thuật vector hóa đơn giản nhất là Bag Of Words (BOW). It
starts with a list of words called the vocabulary (this is often all
the words that occur in the training data)

41
Features extraction
• To use BOW vectorization in Python, we can rely on
CountVectorizer from the scikit-learn library
• scikit-learn has a built-in list of stop words that can
be ignored by passing stop_words="english" to the
vectorizer
• Moreover, we can pass our custom pre-processing function
fromearlier to automatically clean the text before it’s
vectorized.

Training texts: ["This is a good cat", "This is a bad day"]=>


vocabulary: [this, cat, day, is, good, a, bad]
New text: "This day is a good day" --> [1, 0, 2, 1, 1, 1, 0]

42
Sentiment Analysis CaseStudy
• Dữ liệu văn bản IMDB: The IMDB movie reviews dataset is a set
of 50,000 reviews, half of which are positive and the other half
negative
• Chúng ta có thể download dữ liệu từ:
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
• Chúng ta có thư mục chứa dữ liệu là: aclImdb
• we can use the following function to load the training/test datasets from
IMDB

43
Sentiment Analysis CaseStudy

• feature vectors that result from BOW are usually very large
(80,000-dimensional vectors in this case)
• we need to use simple algorithms that are efficient on a large
number of features (e.g., Naive Bayes, linear SVM, or logistic
regression)

44
Cải tiến mô hình hiện tại
• Trích trọng đặc trưng rất là quan trọng (Features Engineering
• There are some biases attached with only looking at how many
times a word occurs in a text. In particular, the longer the text,
the higher its features (word counts) will be
• Dựa vào đặc trưng TF-IDF
• Dựa vào n-gram

45
Cải tiến mô hình
• Đặc trưng TF-IDF
• Chúng ta có thể huấn luyện mô hình Linear SVM với đặc
trưng TF-IDF đơn giản bằng việc thay thế hàm
CountVectorizer bằng hàm TfIdfVectorizer
• Kết quả tăng khoảng 2%

46
Cải tiến mô hình
• Sử dụng các từ độc lập sẽ không tốt. Ví dụ:
• if the word good occurs in a text, we will naturally tend to
say that this text is positive, even if the actual expression
that occurs is actually not good. Cụm từ tốt hơn
• Sử dụng n-gram để xử lý vấn đề này
• An N-gram is a set of N successive words (e.g., very
good [ 2-gram] and not good at all [4-gram]).
Using N-grams, we produce richer word sequences.
• Ví dụ với N =2:
This is a cat. --> [this, is, a, cat, (this, is), (is, a), (a, cat)]
47
Cải tiến mô hình
• In practice, including N-grams in our TF-IDF vectorizer
is as simple as providing an additional parameter
ngram_range=(1, N).
vectorizer = TfidfVectorizer(stop_words="english",
preprocessor=clean_text,
ngram_range=(1, 2))

48
Homework
• Phân tích quan điểm với dữ liệu chuẩn IMDB (link:
http://ai.stanford.edu/~amaas/data/sentiment/)

49
References
• Chapter 4,5 - Speech and Language Processing (3rd ed. draft)
(https://web.stanford.edu/~jurafsky/slp3/4.pdf)
(https://web.stanford.edu/~jurafsky/slp3/5.pdf)

50

You might also like