0% found this document useful (0 votes)

35 views19 pages

Lecture 6 Text Classification

Text Classification is a supervised machine learning task that involves using labeled datasets to train classifiers for categorizing text data. Key processes include text preprocessing, feature extraction, model training, and evaluation using metrics such as precision, recall, and accuracy. The document also discusses various algorithms and techniques for effective classification, including sentiment analysis and email classification.

Uploaded by

felixkipngeno2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views19 pages

Lecture 6 Text Classification

Uploaded by

felixkipngeno2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Text Classification

Text Classification
• Text Classification is an example of supervised machine learning task
since a labelled dataset containing text documents and their labels is
used for train a classifier
• Quite often, we may find ourselves with a set of text data that we’d like to classify
according to some parameters (perhaps the subject of each snippet, for example)
and text classification is what will help us to do this.
• The diagram below illustrates the big-picture view of what we want to do when
classifying text. First, we extract the features we want from our source text (and
any tags or metadata it came with), and then we feed our cleaned data into a
machine learning algorithm that do the classification for us.
Text and Tag
• Sentiment Analysis

•'Awful experience. I would never buy this product again!' → Very Negative
•'I don't think there is anything I really dislike about the product' → Neutral
•'The older interface was much simpler' → Negative
Text and Tag
• Email classification
Text cleaning/Text preprocessing
• Word Tokenization
• Change all the text to lower case
• Punctuation removal
• Stop words removal
• Lexical normalization – stemming, Lemmatization
• Remove Blank rows in Data, if any
• Remove Non-alpha text
Train and Test Set
• split up the data into a training set and a testing set.
• The training data set will be used to fit the model and the predictions will be
performed on the test data set.

• We can train the model using data which we call as training data or training set.
The training data is the one which already has the actual value.
• But how do we know after training the model is overall good ?
For that, we have test data/test set which is basically a different data for which
we know the values but this data was never shown to the model before.
• Thus if the model after training is performing good on test set as well then we can
say that the Machine Learning model is good.at the model should have predicted
and thus the algorithm changes the value of parameters to account for the data
in the training set.
Feature extraction
• Feature Engineering: The next step is the Feature Engineering in which the raw
dataset is transformed into flat features which can be used in a machine learning
model. This step also includes the process of creating new features from the
existing data.
• Bag-of-words: Count Vectors as features
• TF-IDF Vectors as features
• Word level - Matrix representing tf-idf scores of every term in different documents
• N-Gram level - N-grams are the combination of N terms together. This Matrix representing tf-
idf scores of N-grams
• Character level - Matrix representing tf-idf scores of character level n-grams in the corpus
• Word Embeddings as features
• Text / NLP based features
• Topic Models as features
Document term matrix
• Doc 1: I love dogs.
• Doc 2: I hate dogs and knitting.
• Doc 3: Knitting is my hobby and my passion.
• create a matrix of document and words by counting the occurrence of
words in the given document.
Tf-IDF
• Term Frequency(TF), you just count the number of words occurred in
each document divided by the number of words in that document
• IDF(Inverse Document Frequency) measures the amount of
information a given word provides across the document.
• Word with high tf-idf in a document, it is most of the times occurred
in given documents and must be absent in the other documents. So
the words must be a signature word.
Machine Learning Algorithm/model
• Model Training: The next step is the Model Building step in which a
machine learning model is trained on a labelled dataset
• Naïve bayes classifier
• Logistic regression
• Support vector machine
• Artificial Neural Networks
• …
Evaluation
• The training dataset trains the model to predict the unknown labels
of population data.
• There are multiple algorithms, namely, Logistic regression, K-nearest
neighbor, Decision tree, Naive Bayes etc. All these algorithms have
their own style of execution and different techniques of prediction.
• But, at the end, we need to find the effectiveness of an algorithm.
• To find the most suitable algorithm for a particular business problem,
there are few model evaluation techniques.
Evaluation
• For a classification task, positive means that an instance is labeled as
belonging to the class of interest: we may want to automatically gather all
news articles about Microsoft out of a news feed, or identify fraudulent
credit card transactions, classify emails as spam or not e.t.c.
• A false positive is concluding that something is positive when it is not. False
positives are sometimes called Type I errors.
• A false negative is concluding that something is negative when it is not.
False negatives are sometimes called Type II errors.
• True negative is concluding that something is negative when it is actually
negative
• True positive is concluding that something is positive when it is actually
positive
Evaluation
• precision, recall
• Precision means the percentage of your results which are relevant.
• recall refers to the percentage of total relevant results correctly classified by
your algorithm
• f-score
• the harmonic mean of precision and recall:
• near one when both precision and recall are high, and near zero when they
are both low.
• It is a convenient single score to characterize overall accuracy, especially for
comparing the performance of different classifiers.
Example
• Classify email as either spam or not spam
Classify email as either spam or not spam
Actual
Spam Not spam
System spam 12 8
(Predicted) Not spam 3 77

• True Positive: 12 (You have predicted the positive case correctly!), system predicted
spam and the are truly spam.
• True Negative: 77 (You have predicted negative case correctly!), system predicted not
spam and the email are truly not spam.
• False Positive: 8 (Oh! You have predicted these emails are spam, but in actual they are
not spam. This is type-II error in this case.)
• False Negative: 3 (Oh ho! You have predicted that these three emails are not spam. But
actually are spam. This is dangerous! Be careful! This is type-I error in this case.)
Classify email as either spam or not spam

• Accuracy the ratio of the accurately predicted number and the total number of
people which is (12+77)/100 = 0.89.
• Precision - the ratio, 12/(12+8) = 0.6 is the measure of the accuracy of your
model in detecting a person to have the disease.
• Recall - the ratio, 12/(12+3) = 0.8 is the measure of the accuracy of your model to
detect a person having disease out of all the people who are having the disease in
actual.
Actual
Spam Not spam
System spam 12 8
(Predicted) Not spam 3 77

Python Week 4 All GrPA's Solutions
100% (2)
Python Week 4 All GrPA's Solutions
8 pages
ML Notes - 2025
No ratings yet
ML Notes - 2025
145 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
37 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Unit 3
No ratings yet
Unit 3
123 pages
6.data Mining - Classification
No ratings yet
6.data Mining - Classification
37 pages
ML - Mod2 Classification
No ratings yet
ML - Mod2 Classification
74 pages
Maths
33% (6)
Maths
164 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
IR Unit 2 (1,2)
No ratings yet
IR Unit 2 (1,2)
76 pages
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
Classification
No ratings yet
Classification
22 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Dav Unit 3
No ratings yet
Dav Unit 3
50 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
ML Mid Syllabus
No ratings yet
ML Mid Syllabus
182 pages
03 Classification
No ratings yet
03 Classification
66 pages
CH 6
No ratings yet
CH 6
24 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
Unit 3
No ratings yet
Unit 3
27 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Learning AI
No ratings yet
Learning AI
34 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Chapter 19
No ratings yet
Chapter 19
30 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Artificial Intelligence Lec 2
No ratings yet
Artificial Intelligence Lec 2
17 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Machine Learning Note
No ratings yet
Machine Learning Note
40 pages
Classification
No ratings yet
Classification
53 pages
ML Notes UT-2
No ratings yet
ML Notes UT-2
19 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Noida Institute of Engineering and Technology
No ratings yet
Noida Institute of Engineering and Technology
24 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
cs188 Fa22 Note19
No ratings yet
cs188 Fa22 Note19
8 pages
Machine Learning - ch1
No ratings yet
Machine Learning - ch1
46 pages
ML Week 3
No ratings yet
ML Week 3
6 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
Classification: Unit-III
No ratings yet
Classification: Unit-III
90 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Schapire MachineLearning
No ratings yet
Schapire MachineLearning
38 pages
ML Final Print Upload
No ratings yet
ML Final Print Upload
10 pages
Api 681 Compliance
100% (2)
Api 681 Compliance
2 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Decision Tree Classification Example
No ratings yet
Decision Tree Classification Example
3 pages
SD-Access NFR Lab
No ratings yet
SD-Access NFR Lab
29 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
11 Machine Learning System Design PDF
No ratings yet
11 Machine Learning System Design PDF
7 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
21CSC305P ML - Lab Programs 1 - 9
No ratings yet
21CSC305P ML - Lab Programs 1 - 9
36 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Understanding Operating Systems Seventh Edition
No ratings yet
Understanding Operating Systems Seventh Edition
70 pages
PB 244CLX en Os - L
No ratings yet
PB 244CLX en Os - L
8 pages
Cassie System: Installation and Configuration Manual Staff Reference Manual
No ratings yet
Cassie System: Installation and Configuration Manual Staff Reference Manual
331 pages
Shotcrete Testing. When and How?
100% (1)
Shotcrete Testing. When and How?
5 pages
enUS
No ratings yet
enUS
10 pages
The Switched Capacitor Resistor: Electronic Circuit Discrete Time Signal Processing Switches Filters Integrated Circuits
No ratings yet
The Switched Capacitor Resistor: Electronic Circuit Discrete Time Signal Processing Switches Filters Integrated Circuits
2 pages
Tholits+of +may 17+2024
No ratings yet
Tholits+of +may 17+2024
74 pages
Regression Problems (Practical)
No ratings yet
Regression Problems (Practical)
24 pages
Online Business in The Philippines
No ratings yet
Online Business in The Philippines
11 pages
Cns Lab Manual Part 3
No ratings yet
Cns Lab Manual Part 3
12 pages
34-Article Text-57-1-10-20200810
No ratings yet
34-Article Text-57-1-10-20200810
15 pages
Advances in Rasch Analyses in The Human Sciences William J. Boone - Download The Ebook Now To Never Miss Important Information
100% (2)
Advances in Rasch Analyses in The Human Sciences William J. Boone - Download The Ebook Now To Never Miss Important Information
67 pages
Computer Graphics - Chapter 3 (1) - 2
No ratings yet
Computer Graphics - Chapter 3 (1) - 2
35 pages
(Ebooks PDF) Download Windows Security Internals 1 / Converted Edition James Forshaw Full Chapters
100% (3)
(Ebooks PDF) Download Windows Security Internals 1 / Converted Edition James Forshaw Full Chapters
62 pages
Implementing Electronic Document Management
No ratings yet
Implementing Electronic Document Management
18 pages
Smart India Hackathon 2024
No ratings yet
Smart India Hackathon 2024
7 pages
Vac Assignment
No ratings yet
Vac Assignment
17 pages
IELTS Speaking Part 1 - Structure of A Good Answer
No ratings yet
IELTS Speaking Part 1 - Structure of A Good Answer
13 pages
Lesson 3-STS (Midterm)
No ratings yet
Lesson 3-STS (Midterm)
6 pages
Basic Structure of An HTML Document
100% (1)
Basic Structure of An HTML Document
2 pages
Application Details
No ratings yet
Application Details
2 pages
Iwerkz Keyboard Manual
No ratings yet
Iwerkz Keyboard Manual
2 pages
Sinusoidal Wave Controller KLS Broadcast CAN Protocol
No ratings yet
Sinusoidal Wave Controller KLS Broadcast CAN Protocol
5 pages
Rec Pe 72 Rev H Metric Eng
No ratings yet
Rec Pe 72 Rev H Metric Eng
2 pages
Micrometers: How To Read The Scale Measuring Force Limiting Device
No ratings yet
Micrometers: How To Read The Scale Measuring Force Limiting Device
1 page
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet

Lecture 6 Text Classification

Uploaded by

Lecture 6 Text Classification

Uploaded by

Text Classification

You might also like