Lab 08 - Supervised Text Classification-Part 1
Lab 08 - Supervised Text Classification-Part 1
References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in
Artificial Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
QUICK REVIEW
Classification is the task of choosing the correct class label for a given input. In basic classification tasks,
each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some
examples of classification tasks are:
Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports,"
"technology," and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial
institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The basic classification task has a number of interesting variants. For example, in multi-class classification,
each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in
advance; and in sequence classification, a list of inputs are jointly classified.
A classifier is called supervised if it is built based on training corpora containing the correct label for each
input. The framework used by supervised classification is shown below:
a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in
the next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a
model.
b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
feature sets are then fed into the model, which generates predicted labels.
SPELLING CORRECTION
# using Textblob
from textblob import TextBlob
b1 = TextBlob("I havv goood speling!")
print(b1.correct())
b2 = TextBlob("caaaar")
print(b2.correct())
# https://pypi.org/project/language-tool-python/
# pip install language-tool-python
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
print(len(matches))
tool.correct(text)
PRACTICES
def gender_features(word):
return {'last_letter': word[-1]}
gender_features('Shrek')
The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are
case-sensitive strings that typically provide a short human-readable description of the feature, as in the
example 'last letter'. Feature values are values with simple types, such as Booleans, numbers, and
strings.
Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class
labels.
Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a
training set and a test set. The training set is used to train a new "naive Bayes" classifier.
import nltk
featuresets = [(gender_features(n), gender) for (n, gender) in
labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
You can do more about the naive Bayes classifier later. For now, let's just test it out on some names that did
not appear in its training data:
Observe that these character names from The Matrix are correctly classified. Although this science fiction
movie is set in 2199, it still conforms to our expectations about names and genders. We can systematically
evaluate the classifier on a much larger quantity of unseen data:
Finally, we can examine the classifier to determine which features it found most effective for distinguishing
the names' genders:
classifier.show_most_informative_features(5)
This listing shows that the names in the training set that end in "a" are female 33 times more often than they
are male, but names that end in "k" are male 32 times more often than they are female. These ratios are
known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.
When working with large corpora, constructing a single list that contains the features of every instance can
use up a large amount of memory. In these cases, use the function nltk.classify.apply_features,
which returns an object that acts like a list but does not store all the feature sets in memory:
import nltk
import random
from nltk.corpus import movie_reviews
len(documents)
random.shuffle(documents)
print(documents[1])
Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it
should pay attention to.
For document topic identification, we can define a feature for each word, indicating whether the document
contains that word. Remove print(documents[1]) and add the following lines to the code:
all_words = []
for w in movie_reviews.words():
wl = w.lower()
all_words.append(wl)
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["good"])
print(all_words["excellent"])
To limit the number of features that the classifier needs to process, we begin by constructing a list of the
2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks
whether each of these words is present in a given document.
word_features = list(all_words)[:2000]
We can then define a feature extractor that simply checks whether each of these words is present in a given
document.
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews.
To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we
can use show_most_informative_features() to find out which features the classifier found to be
most informative.
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)
In both the situations, we can implement algorithms such as Decision Tree, Sklearn Classifier, Support
Vector Machine (SVC) and Maximum Entropy (MaxEnt) under the NLTK package.
For this kind of classification problems, we can also use one of the following classifiers from the Sklearn
package.
Dataset and Pre-processing is crucial to make the data shape suitable for the algorithms chosen.
KNeighborsClassifier(3)
SVC(kernel="linear", C=0.025)
SVC(gamma=2, C=1)
GaussianProcessClassifier(1.0 * RBF(1.0))
DecisionTreeClassifier(max_depth=5)
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
MLPClassifier(alpha=1, max_iter=1000)
AdaBoostClassifier()
GaussianNB()
LogisticRegression()
SGDClassifier()
Refer:
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-
download-auto-examples-classification-plot-classifier-comparison-py