0% found this document useful (0 votes)
6 views

Lab 08 - Supervised Text Classification-Part 1

LAB DOC

Uploaded by

Don Pablo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lab 08 - Supervised Text Classification-Part 1

LAB DOC

Uploaded by

Don Pablo
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

Lab 08: Supervised Text Classification (Part 1)

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in
Artificial Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.

QUICK REVIEW

Classification is the task of choosing the correct class label for a given input. In basic classification tasks,
each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some
examples of classification tasks are:
 Deciding whether an email is spam or not.
 Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports,"
"technology," and "politics."
 Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial
institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The basic classification task has a number of interesting variants. For example, in multi-class classification,
each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in
advance; and in sequence classification, a list of inputs are jointly classified.
A classifier is called supervised if it is built based on training corpora containing the correct label for each
input. The framework used by supervised classification is shown below:

a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in
the next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a
model.
b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
feature sets are then fed into the model, which generates predicted labels.

Level 3 Asia Pacific University (APU) Page 1 of 6


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

SPELLING CORRECTION

# using Autocorrect (pip install autocorrect)


from autocorrect import Speller
spell = Speller(lang='en')
spell('caaaar')
spell('mussage')
spell('survice')
spell('hte')

# using Textblob
from textblob import TextBlob
b1 = TextBlob("I havv goood speling!")
print(b1.correct())
b2 = TextBlob("caaaar")
print(b2.correct())

# using pip install pyspellchecker


from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
# Get the one `most likely` answer
for word in misspelled:
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))

GRAMMAR AND SPELL CHECK USING PYTHON

# pip install language-check


# https://pypi.org/project/language-check/
import language_check
tool = language_check.LanguageTool('en-US')
text = "he are a human"
match = tool.check(text)
print(match)
print(len(match))
language_check.correct(text, match)

# https://pypi.org/project/language-tool-python/
# pip install language-tool-python
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')

text = "he are a human"


matches = tool.check(text)

Level 3 Asia Pacific University (APU) Page 2 of 6


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

print(len(matches))
tool.correct(text)

PRACTICES

Text/Document Classification (Gender Identification)


In “Natural Language Processing with Python” (the third reference above), in the Section 2.4, we can see
that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be
female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these
differences more precisely.
The first step in creating a classifier is deciding what features of the input are relevant, and how to encode
those features. For this example, we'll start by just looking at the final letter of a given name. The following
feature extractor function builds a dictionary containing relevant information about a given name:

def gender_features(word):
return {'last_letter': word[-1]}
gender_features('Shrek')

The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are
case-sensitive strings that typically provide a short human-readable description of the feature, as in the
example 'last letter'. Feature values are values with simple types, such as Booleans, numbers, and
strings.
Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class
labels.

from nltk.corpus import names


labeled_names = ([(name, 'male') for name in
names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a
training set and a test set. The training set is used to train a new "naive Bayes" classifier.

import nltk
featuresets = [(gender_features(n), gender) for (n, gender) in
labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

You can do more about the naive Bayes classifier later. For now, let's just test it out on some names that did
not appear in its training data:

print("Neo is a", classifier.classify(gender_features('Neo')))


print("Trinity is a",
classifier.classify(gender_features('Trinity')))

Level 3 Asia Pacific University (APU) Page 3 of 6


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

Observe that these character names from The Matrix are correctly classified. Although this science fiction
movie is set in 2199, it still conforms to our expectations about names and genders. We can systematically
evaluate the classifier on a much larger quantity of unseen data:

print("\nThe accuracy is equal to: ",


nltk.classify.accuracy(classifier, test_set))

Finally, we can examine the classifier to determine which features it found most effective for distinguishing
the names' genders:

classifier.show_most_informative_features(5)

This listing shows that the names in the training set that end in "a" are female 33 times more often than they
are male, but names that end in "k" are male 32 times more often than they are female. These ratios are
known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.
When working with large corpora, constructing a single list that contains the features of every instance can
use up a large amount of memory. In these cases, use the function nltk.classify.apply_features,
which returns an object that acts like a list but does not store all the feature sets in memory:

from nltk.classify import apply_features


train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

Text/Document Classification (Sentiment Analysis - Movie Review)


There are several examples of corpora where documents have been labelled with categories. Using these
corpora, we can build classifiers that will automatically tag new documents with appropriate category labels.
First, we construct a list of documents, labelled with the appropriate categories. For this example, we've
chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)


for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

len(documents)

for row in range(0, len(documents)-1):


print(documents[row][1])

random.shuffle(documents)
print(documents[1])

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it
should pay attention to.

Level 3 Asia Pacific University (APU) Page 4 of 6


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

For document topic identification, we can define a feature for each word, indicating whether the document
contains that word. Remove print(documents[1]) and add the following lines to the code:

all_words = []
for w in movie_reviews.words():
wl = w.lower()
all_words.append(wl)

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

Remove print(all_words.most_common(15)) and add the following lines to the code:

print(all_words["good"])
print(all_words["excellent"])

To limit the number of features that the classifier needs to process, we begin by constructing a list of the
2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks
whether each of these words is present in a given document.

word_features = list(all_words)[:2000]

We can then define a feature extractor that simply checks whether each of these words is present in a given
document.

def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)

return features

You can test the feature extractor, by:

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews.

featureset = [(find_features(rev), category) for (rev, category) in


documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(5)

To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we
can use show_most_informative_features() to find out which features the classifier found to be
most informative.

Level 3 Asia Pacific University (APU) Page 5 of 6


CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(5)

In both the situations, we can implement algorithms such as Decision Tree, Sklearn Classifier, Support
Vector Machine (SVC) and Maximum Entropy (MaxEnt) under the NLTK package.

For this kind of classification problems, we can also use one of the following classifiers from the Sklearn
package.
Dataset and Pre-processing is crucial to make the data shape suitable for the algorithms chosen.
 KNeighborsClassifier(3)
 SVC(kernel="linear", C=0.025)
 SVC(gamma=2, C=1)
 GaussianProcessClassifier(1.0 * RBF(1.0))
 DecisionTreeClassifier(max_depth=5)
 RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
 MLPClassifier(alpha=1, max_iter=1000)
 AdaBoostClassifier()
 GaussianNB()
 LogisticRegression()
 SGDClassifier()

Refer:
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-
download-auto-examples-classification-plot-classifier-comparison-py

Level 3 Asia Pacific University (APU) Page 6 of 6

You might also like