0% found this document useful (0 votes)

45 views6 pages

Supervised Text Classification Techniques

LAB DOC

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views6 pages

Supervised Text Classification Techniques

LAB DOC

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

Lab 08: Supervised Text Classification (Part 1)

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in
Artificial Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.

QUICK REVIEW

Classification is the task of choosing the correct class label for a given input. In basic classification tasks,
each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some
examples of classification tasks are:
 Deciding whether an email is spam or not.
 Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports,"
"technology," and "politics."
 Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial
institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The basic classification task has a number of interesting variants. For example, in multi-class classification,
each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in
advance; and in sequence classification, a list of inputs are jointly classified.
A classifier is called supervised if it is built based on training corpora containing the correct label for each
input. The framework used by supervised classification is shown below:

a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in
the next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a
model.
b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
feature sets are then fed into the model, which generates predicted labels.

Level 3 Asia Pacific University (APU) Page 1 of 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

SPELLING CORRECTION

# using Autocorrect (pip install autocorrect)

from autocorrect import Speller
spell = Speller(lang='en')
spell('caaaar')
spell('mussage')
spell('survice')
spell('hte')

# using Textblob
from textblob import TextBlob
b1 = TextBlob("I havv goood speling!")
print([Link]())
b2 = TextBlob("caaaar")
print([Link]())

# using pip install pyspellchecker

from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = [Link](['something', 'is', 'hapenning', 'here'])
# Get the one `most likely` answer
for word in misspelled:
print([Link](word))
# Get a list of `likely` options
print([Link](word))

GRAMMAR AND SPELL CHECK USING PYTHON

# pip install language-check

# [Link]
import language_check
tool = language_check.LanguageTool('en-US')
text = "he are a human"
match = [Link](text)
print(match)
print(len(match))
language_check.correct(text, match)

# [Link]
# pip install language-tool-python
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')

text = "he are a human"

matches = [Link](text)

Level 3 Asia Pacific University (APU) Page 2 of 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

print(len(matches))
[Link](text)

PRACTICES

Text/Document Classification (Gender Identification)

In “Natural Language Processing with Python” (the third reference above), in the Section 2.4, we can see
that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be
female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these
differences more precisely.
The first step in creating a classifier is deciding what features of the input are relevant, and how to encode
those features. For this example, we'll start by just looking at the final letter of a given name. The following
feature extractor function builds a dictionary containing relevant information about a given name:

def gender_features(word):
return {'last_letter': word[-1]}
gender_features('Shrek')

The returned dictionary, known as a feature set, maps from feature names to their values. Feature names are
case-sensitive strings that typically provide a short human-readable description of the feature, as in the
example 'last letter'. Feature values are values with simple types, such as Booleans, numbers, and
strings.
Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class
labels.

from [Link] import names

labeled_names = ([(name, 'male') for name in
[Link]('[Link]')] +
[(name, 'female') for name in [Link]('[Link]')])
import random
[Link](labeled_names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a
training set and a test set. The training set is used to train a new "naive Bayes" classifier.

import nltk
featuresets = [(gender_features(n), gender) for (n, gender) in
labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = [Link](train_set)

You can do more about the naive Bayes classifier later. For now, let's just test it out on some names that did
not appear in its training data:

print("Neo is a", [Link](gender_features('Neo')))

print("Trinity is a",
[Link](gender_features('Trinity')))

Level 3 Asia Pacific University (APU) Page 3 of 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

Observe that these character names from The Matrix are correctly classified. Although this science fiction
movie is set in 2199, it still conforms to our expectations about names and genders. We can systematically
evaluate the classifier on a much larger quantity of unseen data:

print("\nThe accuracy is equal to: ",

[Link](classifier, test_set))

Finally, we can examine the classifier to determine which features it found most effective for distinguishing
the names' genders:

classifier.show_most_informative_features(5)

This listing shows that the names in the training set that end in "a" are female 33 times more often than they
are male, but names that end in "k" are male 32 times more often than they are female. These ratios are
known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.
When working with large corpora, constructing a single list that contains the features of every instance can
use up a large amount of memory. In these cases, use the function [Link].apply_features,
which returns an object that acts like a list but does not store all the feature sets in memory:

from [Link] import apply_features

train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

Text/Document Classification (Sentiment Analysis - Movie Review)

There are several examples of corpora where documents have been labelled with categories. Using these
corpora, we can build classifiers that will automatically tag new documents with appropriate category labels.
First, we construct a list of documents, labelled with the appropriate categories. For this example, we've
chosen the Movie Reviews Corpus, which categorizes each review as positive or negative.

import nltk
import random
from [Link] import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)

for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

len(documents)

for row in range(0, len(documents)-1):

print(documents[row][1])

[Link](documents)
print(documents[1])

Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it
should pay attention to.

Level 3 Asia Pacific University (APU) Page 4 of 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

For document topic identification, we can define a feature for each word, indicating whether the document
contains that word. Remove print(documents[1]) and add the following lines to the code:

all_words = []
for w in movie_reviews.words():
wl = [Link]()
all_words.append(wl)

all_words = [Link](all_words)
print(all_words.most_common(15))

Remove print(all_words.most_common(15)) and add the following lines to the code:

print(all_words["good"])
print(all_words["excellent"])

To limit the number of features that the classifier needs to process, we begin by constructing a list of the
2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks
whether each of these words is present in a given document.

word_features = list(all_words)[:2000]

We can then define a feature extractor that simply checks whether each of these words is present in a given
document.

def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)

return features

You can test the feature extractor, by:

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews.

featureset = [(find_features(rev), category) for (rev, category) in

documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = [Link](train_set)
classifier.show_most_informative_features(5)

To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we
can use show_most_informative_features() to find out which features the classifier found to be
most informative.

Level 3 Asia Pacific University (APU) Page 5 of 6

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Supervised Text Classification 1

print([Link](classifier, test_set))
classifier.show_most_informative_features(5)

In both the situations, we can implement algorithms such as Decision Tree, Sklearn Classifier, Support
Vector Machine (SVC) and Maximum Entropy (MaxEnt) under the NLTK package.

For this kind of classification problems, we can also use one of the following classifiers from the Sklearn
package.
Dataset and Pre-processing is crucial to make the data shape suitable for the algorithms chosen.
 KNeighborsClassifier(3)
 SVC(kernel="linear", C=0.025)
 SVC(gamma=2, C=1)
 GaussianProcessClassifier(1.0 * RBF(1.0))
 DecisionTreeClassifier(max_depth=5)
 RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
 MLPClassifier(alpha=1, max_iter=1000)
 AdaBoostClassifier()
 GaussianNB()
 LogisticRegression()
 SGDClassifier()

Refer:
[Link]
download-auto-examples-classification-plot-classifier-comparison-py

Level 3 Asia Pacific University (APU) Page 6 of 6

Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Text Classification
No ratings yet
Text Classification
60 pages
Text Classification
No ratings yet
Text Classification
7 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
3) Sentiment Analysis of Tweets Including Emoji Data
No ratings yet
3) Sentiment Analysis of Tweets Including Emoji Data
22 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Text Classification Lecture Notes
No ratings yet
Text Classification Lecture Notes
26 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
Unstructured Text Classification Guide
No ratings yet
Unstructured Text Classification Guide
37 pages
Addressing Sentiment Analysis Challenges
No ratings yet
Addressing Sentiment Analysis Challenges
8 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Unit 3
No ratings yet
Unit 3
27 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
Parabot Notes PDF
No ratings yet
Parabot Notes PDF
2 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
CPSC 436N: Text Classification Approaches
No ratings yet
CPSC 436N: Text Classification Approaches
115 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Sentiment Analysis in Twitter: Rohit Kumar Jha (11615) Sakaar Khurana (10627)
No ratings yet
Sentiment Analysis in Twitter: Rohit Kumar Jha (11615) Sakaar Khurana (10627)
9 pages
Pre Processing
No ratings yet
Pre Processing
9 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
Sentiment Analysis: A NLP And: 2. Detailed Approach
No ratings yet
Sentiment Analysis: A NLP And: 2. Detailed Approach
6 pages
Machen e Learning
No ratings yet
Machen e Learning
9 pages
3 Classification 1
No ratings yet
3 Classification 1
55 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
NLP Twitter Sentiment Analysis
No ratings yet
NLP Twitter Sentiment Analysis
3 pages
Article Classification Using Natural Language Processing and Machine Learning
No ratings yet
Article Classification Using Natural Language Processing and Machine Learning
8 pages
Twitter Sentiment Analysis Techniques
No ratings yet
Twitter Sentiment Analysis Techniques
15 pages
Airline Tweets Classification Using Naive Bayes Classifier
No ratings yet
Airline Tweets Classification Using Naive Bayes Classifier
2 pages
Text Classification Guide & Datasets
No ratings yet
Text Classification Guide & Datasets
24 pages
Comparing KNN, Logistic Regression, and Random Forest
No ratings yet
Comparing KNN, Logistic Regression, and Random Forest
16 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Lect 05
No ratings yet
Lect 05
17 pages
NLP Escalation Model with BlazingText
No ratings yet
NLP Escalation Model with BlazingText
47 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
Week 4
No ratings yet
Week 4
45 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Review 3 - Journal Submission Format: Team Number Title (New)
No ratings yet
Review 3 - Journal Submission Format: Team Number Title (New)
28 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Bag - of - Words NLP
100% (1)
Bag - of - Words NLP
23 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
17 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Tutorial3 Moment Couple
No ratings yet
Tutorial3 Moment Couple
6 pages
Questionnaire Design Guide
No ratings yet
Questionnaire Design Guide
6 pages
Wa0022
No ratings yet
Wa0022
2 pages
Design and Analysis of Algorithm: Binary Tree
No ratings yet
Design and Analysis of Algorithm: Binary Tree
18 pages
GW3rdchp2prbs PDF
No ratings yet
GW3rdchp2prbs PDF
4 pages
Settlement of Piled Foundations Using Equivalent Raft Approach
No ratings yet
Settlement of Piled Foundations Using Equivalent Raft Approach
17 pages
Deblurring Motion in Barcode Images
No ratings yet
Deblurring Motion in Barcode Images
7 pages
B.sc. Electrical Engineering (Electronics) 4th Semester, Section A, Morning, Session Spring 2020
No ratings yet
B.sc. Electrical Engineering (Electronics) 4th Semester, Section A, Morning, Session Spring 2020
4 pages
CNC Machines: Key Concepts & Issues
No ratings yet
CNC Machines: Key Concepts & Issues
95 pages
Textbook: Digital Design, 6 - Edition: M. Morris Mano and Michael D. Ciletti
No ratings yet
Textbook: Digital Design, 6 - Edition: M. Morris Mano and Michael D. Ciletti
41 pages
Classical Mechanics Csir Net
No ratings yet
Classical Mechanics Csir Net
14 pages
General Relativity PDF
No ratings yet
General Relativity PDF
6 pages
Understanding Demography and Population Trends
No ratings yet
Understanding Demography and Population Trends
45 pages
Metrics For Mechanics
No ratings yet
Metrics For Mechanics
64 pages
Case Problem 2 Distribution Systems Design
No ratings yet
Case Problem 2 Distribution Systems Design
6 pages
Plate No. 1
No ratings yet
Plate No. 1
7 pages
Answers
No ratings yet
Answers
13 pages
Lecture Notes For ECON 631
No ratings yet
Lecture Notes For ECON 631
556 pages
Today 500+ Midterm Papers by PIN2 and MUHAMMAD (MAS All Rounder)
No ratings yet
Today 500+ Midterm Papers by PIN2 and MUHAMMAD (MAS All Rounder)
545 pages
Time-Temperature Relations in Tempering Steel: (New 1pqj1)
100% (1)
Time-Temperature Relations in Tempering Steel: (New 1pqj1)
27 pages
Vásquez-Castillo Et Al. (2023)
No ratings yet
Vásquez-Castillo Et Al. (2023)
12 pages
Pile Design Using SPT Values ESJ Feb-2020
No ratings yet
Pile Design Using SPT Values ESJ Feb-2020
14 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Quiz5 - Solution PDF
No ratings yet
Quiz5 - Solution PDF
3 pages
Maths - Xii - MS - PB - 2024 - 25 - Set - 2 (A)
No ratings yet
Maths - Xii - MS - PB - 2024 - 25 - Set - 2 (A)
12 pages
Homework/Assignment: (Chapter 9: Input Modelling)
No ratings yet
Homework/Assignment: (Chapter 9: Input Modelling)
11 pages
Conversion Inch CM
No ratings yet
Conversion Inch CM
1 page
8 5 SM STS Handout Pt1 AnswerKeyComplete
No ratings yet
8 5 SM STS Handout Pt1 AnswerKeyComplete
4 pages
Structural FEM Checklist Guide
No ratings yet
Structural FEM Checklist Guide
2 pages
2024 KZN Informal Test - Calculus
No ratings yet
2024 KZN Informal Test - Calculus
2 pages