Natural Language Processing CS 1462
Natural Language Processing CS 1462
Natural Language Processing CS 1462
CS 1462
1
Some slides borrows from Carl Sable
If you are following along in the book…
2
Information Retrieval
3
Conventional IR
5
Vector Space Models
6
Word Embedding
7
Basic idea of Embedding
8
Basic idea of Embedding
9
Text Vectors
10
Text Vectors
By research, Moscow is the closest city to this value, and this applies to
several other countries.
11
Text Vectors
12
Measuring Similarity Between Vectors
13
Measuring Similarity Between Vectors
14
Text Feature Extraction
Bag-of-words(BoW)
TF-IDF or ( Term Frequency(TF) — Inverse Dense
Frequency(IDF) )
One Hot Encoding
15
Bag-of-words(BoW)
16
Bag-of-words(BoW) Example
17
Bag-of-words(BoW) Example
19
TF-IDF or ( Term Frequency(TF) — Inverse Dense
Frequency(IDF) )
4 files
TF
21
TF-IDF - Example
IDF
22
23
Arabic language encoding
24
Arabic language encoding
25
Arabic language encoding
26
Other Issues
As with other NLP applications, we need to decide whether an IR system should apply
stemming (common), apply lemmatization (not common), convert all letters to the same
case, etc.
We need to be consistent with the text normalization techniques applied to the collection
Many conventional IR systems also use stop lists, containing stop words, which are just
lists of very common words to exclude from the computation
This doesn't change results much, since these words tend to have very low IDF values; it
could make the system more efficient, but makes it more difficult to search for phrases
Another technique used to by some conventional IR systems is to add synonyms of query
terms to the queries to help locate documents that are relevant but don't use overlapping
terms
The vector space model is not the only way to implement a conventional IR system;
another approach (more common for earlier IR systems) was to use a Bayesian model
We will consider a Bayesian approach for text categorization, in addition to an approach
using a vector space model, and also a k-nearest neighbors (KNN) approach
We will later discuss how to evaluate text categorization systems, which uses some of the
same metrics (but text categorization doesn't share all the issues that make evaluation of
IR more difficult)
27
Ranking Web Pages
28
Text Categorization
31
Conventional TC Approaches
32
Text Normalization for TC
33
Machine Learning for TC
34
K-Nearest Neighbors
Once all distances are computed, you find the k closest documents from the
training set
The value of k can be manually coded or chosen based on cross-validation
experiments or a tuning set
The categories of these k nearest neighbors will be used to select the category
or categories of the new document
As described, this method is general, not just for text categorization
35
Choosing a Category with KNN
36
KNN Applied to TC
37
Naïve Bayes
The argmax above assumes that categories are mutually exclusive and exhaustive
We eliminated the denominator, P(d), because it is independent of the categories
P(c) is just the prior probability of a category
This can be estimated based on the training set as the frequency of training
documents falling into the category, c
We still must estimate P(d|c)
It is typically assumed that d can be represented as a set of features with values
We start with the "naïve" assumption that the probability distribution for each
feature in a document given its category is not affected by the other features
Note that naïve Bayes, as described so far, is a general approach, not just for TC
38
Naïve Bayes Applied to TC
Naïve Bayes for text categorization is still considered a bag-of-words approach, but it does not
use a vector space model, and it does not rely on TF*IDF word weights (or any word weights)
The features are the words of the vocabulary, and they are typically considered Boolean features
In other words, all that matters is whether each word does or does not appear in the document
The "naïve" assumption for TC is that the probability of seeing a word in a document given its
specific category is not affected by the other words in the document
Note that this is clearly not true in reality, but Naïve Bayes can still perform well for TC
The standard technique is to estimate, based on the training set, the probability of seeing each
possible term (or word), t, in each possible category, c; we will call this P(t|c)
This leads to: P d c = ςt∈d P(t|c)
The probability estimates are small, so we use log probabilities instead of probabilities, leading
to the predicted category (assuming categories are mutually exclusive), cො, for a document, d,
being:
39
Implementation Details of Naïve Bayes for TC
40
Evaluating TC Systems
Evaluating a TC system is generally easier than for IR, because there is typically a labeled test
set
Overall accuracy is often a reasonable metric for mutually exclusive and exhaustive
categories
However, this is not reliable for binary categories; most documents will generally not belong to
most categories, so a trivial system that says no to everything might appear to do very well
This was true of the most common conventional TC corpus, the Reuter's corpus; it includes
9,603 training documents and 3,299 test documents; there are 135 "economic subject
categories"
Therefore, for each category, it is common to measure precision, recall, and a metric called
F1 (or the more general F-measure)
These metrics can be used for categorization in general, not just TC
Consider a confusion matrix (which is a type of contingency table) where rows represent a
system's predictions, and the columns represent the actual categorizations of documents
The next slide shows a generic example of such a confusion matrix for a single binary category
We will define all the metrics mentioned above in relation to such a confusion matrix (formulas
are shown on the next page)
41
TC Evaluation Metrics (per category)
42
TC Evaluation Metrics (per system)
When there are multiple binary categories, it is useful to compute metrics that
evaluate the system as a whole
Two ways to combine the category-specific metrics are micro-averaging and
macro-averaging
To compute the micro-averages for the first three metrics, you combine all the
confusion matrices (for all categories) to obtain global counts for A, B, C, and D
You then apply the formulas from the previous slide to compute the micro-averaged
overall accuracy, precision, and recall
The formula for F1 remains the same and is based on the micro-averaged values of
precision and recall
To compute macro-averages for the first three metrics, you average together the
values of overall accuracy, precision, or recall for the individual categories
The formula for F1 remains the same and is based on the macro-averaged values of
precision and recall
Micro-averaging weighs each decision equally; macro-averaging weighs each
category equally
43
Properties of Categories
44
Properties of Data
45
Wishing you a fruitful educatio
nal experience
46