ML7 - Text Classification

Text classification is a widely used natural language processing task to automatically categorize text documents into predefined categories. The key steps involve preparing text data, engineering features from the text like counts, TF-IDF, word embeddings, training classifiers like Naive Bayes, SVM, Random Forest on the features, and evaluating the model performance. An example is classifying Amazon reviews as positive or negative sentiment based on the text.

Uploaded by

param_email

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views13 pages

ML7 - Text Classification

Uploaded by

param_email

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Text Classification

REFERENCES

A Comprehensive Guide to Understand and Implement Text Classificatio

n in Python (analyticsvidhya.com)

Natural Language Processing | NLP in Python | NLP Libraries (analyticsvi

dhya.com)
Text Classification
• Widely used Natural Language Processing (NLP) task
• Goal: Automatically classify text documents into one or more defined
categories (Supervised ML Algorithm)
• Examples of Text Classification
• Understanding audience sentiment from social media
• Detection of spam and non-spam emails
• Auto tagging of customer queries
• Categorization of news articles into defined topics
Text Classification Pipeline
• Dataset Preparation
• Loading dataset and performing basic pre-processing
• Dataset is then split into training and test/validation sets
• Feature Engineering
• Raw dataset transformed into flat features which can be used in a ML model
• Plus process of creating new features from existing data
• Model Training
• A ML model is trained on a labelled dataset

• Our Example Problem Statement

• To detect sentiment in Amazon Reviews
• Classify Reviews (‘0’: negative or ‘1’: positive)
Data Preparation
• Dataset
• Amazon Reviews (10,000 in number)
• Filename: ‘corpus’
• Text (variable length review) & Label (Label_1: Negative review; Label_2:
Positive review)
• Load dataset & create dataframe using texts & labels
• Split the dataset into training & validation sets & encode the target
variable
Feature Engineering
• Raw text data transformed into feature vectors and new features
created using the existing dataset
• Different Methods / Ways of extracting vectors from text
• Count Vectors as features
• TF-IDF Vectors as features
• Word Embeddings as features
Feature Engineering
• Count Vectors as features (Bag-of-Words)
• Matrix notation of the dataset in which
• every row represents a document from the corpus
• every column represents a term from the corpus
• every cell represents the frequency count of a particular term in a particular document
Feature Engineering
• TF-IDF as features (advanced version of Bag-of-Words)
• TF-IDF score represents the relative importance of a term in the document and the
entire corpus (words with higher scores of weight are deemed to be more
significant)
• TF(t) = (Number of times term t appears in a document) / (Total number of terms in
the document)
• IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
• Value of a word increases proportionally to count in the document, but it is inversely
proportional to the frequency of the word in the corpus
Document 1 It is going to rain today.
Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.
Feature Engineering
Word Embeddings (Word2Vec)
• (188) What is Word2Vec? A Simple Explanation | Deep Learning Tutorial 41
(
Tensorflow, Keras & Python) – YouTube
• Word embedding
• form of representing words using a dense vector representation
• position of a word within the vector space is learned from text and is based on the
words that surround the word (context)
• can be trained using the input corpus itself
• can be generated using pre-trained word embeddings such as GloVe,
FastText, and Word2Vec using transfer-learning
• Or use cutting edge methods like BERT / GPT-3
• can be used as feature vectors for ML model, used to measure text similarity using
cosine similarity techniques, words clustering and text classification techniques
Model Building & Training
• train_model Utility Function
• used to train a model
• Accepts (as inputs)
• the classifier
• feature_vector of training data
• labels of training data
• feature vectors of valid data
• Using inputs, model is trained and accuracy score is computed
Model Building & Training
• Naïve Bayes Classifier
• based on Bayes’ Theorem with an assumption of independence among
predictors (presence of a particular feature in a class is unrelated to the
presence of any other feature)
• Naïve Bayes on Count Vectors
• Naïve Bayes on TF-IDF Vectors
Model Building & Training
• SVM Classifier
• extracts a best possible hyper-plane / line that segregates the two classes
• SVM on Count Vectors
• SVM on TF-IDF Vectors
Model Building & Training
• Random Forest Classifier
• ensemble model (bagging method) - using decision tree as the individual
model
• Random Forest on Count Vectors
• Random Forest on TF-IDF Vectors

Misa Santa Fe PDF
80% (5)
Misa Santa Fe PDF
26 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Metaphor in Practice - Niklas Torneke Torneke
No ratings yet
Metaphor in Practice - Niklas Torneke Torneke
231 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
NLP m4
No ratings yet
NLP m4
97 pages
First Quarter Grasps For Performance Task #1: Writing Speech Choir Piece
No ratings yet
First Quarter Grasps For Performance Task #1: Writing Speech Choir Piece
3 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
No ratings yet
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
29 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
No ratings yet
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
17 pages
06-04-2024 - JR - Super60 (Incoming) - NUCLEUS BT - Jee-Main - Special Test WTM - Q.Paper
No ratings yet
06-04-2024 - JR - Super60 (Incoming) - NUCLEUS BT - Jee-Main - Special Test WTM - Q.Paper
15 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
Unit 2
No ratings yet
Unit 2
26 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
Lect 05
No ratings yet
Lect 05
17 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
CH 04
No ratings yet
CH 04
47 pages
Machine Learning Fake News Blocking
No ratings yet
Machine Learning Fake News Blocking
14 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Project Report
No ratings yet
Project Report
6 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Chapter 4 After Modfiy
No ratings yet
Chapter 4 After Modfiy
4 pages
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
No ratings yet
Toxic Comment Classification Using Natural Language Processing IRJET-V7I61123
4 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Module III
No ratings yet
Module III
42 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Unit IV
No ratings yet
Unit IV
58 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Lect 04
No ratings yet
Lect 04
44 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Text Classification
No ratings yet
Text Classification
60 pages
Remote File Inclusion
0% (1)
Remote File Inclusion
7 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Unit IV
No ratings yet
Unit IV
57 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
Verbs in Early Modern English
No ratings yet
Verbs in Early Modern English
9 pages
Lect 5
No ratings yet
Lect 5
40 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Lec # 9
No ratings yet
Lec # 9
18 pages
DSP Project
No ratings yet
DSP Project
23 pages
Unit-III NLP
No ratings yet
Unit-III NLP
15 pages
Day1 SQL
100% (1)
Day1 SQL
7 pages
Keep It Going (B1)
No ratings yet
Keep It Going (B1)
3 pages
The Soul As Second Self Before Plato
No ratings yet
The Soul As Second Self Before Plato
48 pages
Immanence - A Life
No ratings yet
Immanence - A Life
3 pages
About Kakuro
No ratings yet
About Kakuro
10 pages
Maguire Mackenzie Resume
No ratings yet
Maguire Mackenzie Resume
2 pages
Goals For Psi
No ratings yet
Goals For Psi
3 pages
English Lesson Family
No ratings yet
English Lesson Family
3 pages
Hi! No Doubt You Know Me. Yes, Yes I Am William: Shakespeare!
No ratings yet
Hi! No Doubt You Know Me. Yes, Yes I Am William: Shakespeare!
16 pages
Revelation Chapter 20 Vrs. 1-3
No ratings yet
Revelation Chapter 20 Vrs. 1-3
3 pages
Santification
No ratings yet
Santification
1 page
10.1016@s0010 44850100100 2
No ratings yet
10.1016@s0010 44850100100 2
13 pages
Method of Historical Research QP and Scheme
No ratings yet
Method of Historical Research QP and Scheme
5 pages
Tutorial 6
No ratings yet
Tutorial 6
5 pages
Intro To C - Module 3
No ratings yet
Intro To C - Module 3
13 pages
On Every LOCKET (TABEEZ) Product Individual Pages - Completed
No ratings yet
On Every LOCKET (TABEEZ) Product Individual Pages - Completed
3 pages
COA Mod 3
No ratings yet
COA Mod 3
30 pages
SubjectwiseCutOffs-1
No ratings yet
SubjectwiseCutOffs-1
2 pages
GEC 5 LESSON 5 Communication For Academic Purposes
No ratings yet
GEC 5 LESSON 5 Communication For Academic Purposes
17 pages
Test (Allophones and Aspiration)
No ratings yet
Test (Allophones and Aspiration)
3 pages
English Lesson N3
No ratings yet
English Lesson N3
9 pages
Grammar Practice 2
No ratings yet
Grammar Practice 2
2 pages
Mastering Objectoriented Python
From Everand
Mastering Objectoriented Python
Steven F. Lott
5/5 (2)
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)

ML7 - Text Classification

Uploaded by

ML7 - Text Classification

Uploaded by

Text Classification

A Comprehensive Guide to Understand and Implement Text Classificatio

Natural Language Processing | NLP in Python | NLP Libraries (analyticsvi

• Our Example Problem Statement

You might also like