0% found this document useful (0 votes)
33 views13 pages

ML7 - Text Classification

Text classification is a widely used natural language processing task to automatically categorize text documents into predefined categories. The key steps involve preparing text data, engineering features from the text like counts, TF-IDF, word embeddings, training classifiers like Naive Bayes, SVM, Random Forest on the features, and evaluating the model performance. An example is classifying Amazon reviews as positive or negative sentiment based on the text.

Uploaded by

param_email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views13 pages

ML7 - Text Classification

Text classification is a widely used natural language processing task to automatically categorize text documents into predefined categories. The key steps involve preparing text data, engineering features from the text like counts, TF-IDF, word embeddings, training classifiers like Naive Bayes, SVM, Random Forest on the features, and evaluating the model performance. An example is classifying Amazon reviews as positive or negative sentiment based on the text.

Uploaded by

param_email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Text Classification

REFERENCES

A Comprehensive Guide to Understand and Implement Text Classificatio


n in Python (analyticsvidhya.com)

Natural Language Processing | NLP in Python | NLP Libraries (analyticsvi


dhya.com)
Text Classification
• Widely used Natural Language Processing (NLP) task
• Goal: Automatically classify text documents into one or more defined
categories (Supervised ML Algorithm)
• Examples of Text Classification
• Understanding audience sentiment from social media
• Detection of spam and non-spam emails
• Auto tagging of customer queries
• Categorization of news articles into defined topics
Text Classification Pipeline
• Dataset Preparation
• Loading dataset and performing basic pre-processing
• Dataset is then split into training and test/validation sets
• Feature Engineering
• Raw dataset transformed into flat features which can be used in a ML model
• Plus process of creating new features from existing data
• Model Training
• A ML model is trained on a labelled dataset

• Our Example Problem Statement


• To detect sentiment in Amazon Reviews
• Classify Reviews (‘0’: negative or ‘1’: positive)
Data Preparation
• Dataset
• Amazon Reviews (10,000 in number)
• Filename: ‘corpus’
• Text (variable length review) & Label (Label_1: Negative review; Label_2:
Positive review)
• Load dataset & create dataframe using texts & labels
• Split the dataset into training & validation sets & encode the target
variable
Feature Engineering
• Raw text data transformed into feature vectors and new features
created using the existing dataset
• Different Methods / Ways of extracting vectors from text
• Count Vectors as features
• TF-IDF Vectors as features
• Word Embeddings as features
Feature Engineering
• Count Vectors as features (Bag-of-Words)
• Matrix notation of the dataset in which
• every row represents a document from the corpus
• every column represents a term from the corpus
• every cell represents the frequency count of a particular term in a particular document
Feature Engineering
• TF-IDF as features (advanced version of Bag-of-Words)
• TF-IDF score represents the relative importance of a term in the document and the
entire corpus (words with higher scores of weight are deemed to be more
significant)
• TF(t) = (Number of times term t appears in a document) / (Total number of terms in
the document)
• IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
• Value of a word increases proportionally to count in the document, but it is inversely
proportional to the frequency of the word in the corpus
Document 1 It is going to rain today.
Document 2 Today I am not going outside.
Document 3 I am going to watch the season premiere.
Feature Engineering
Word Embeddings (Word2Vec)
• (188) What is Word2Vec? A Simple Explanation | Deep Learning Tutorial 41
(
Tensorflow, Keras & Python) – YouTube
• Word embedding
• form of representing words using a dense vector representation
• position of a word within the vector space is learned from text and is based on the
words that surround the word (context)
• can be trained using the input corpus itself
• can be generated using pre-trained word embeddings such as GloVe,
FastText, and Word2Vec using transfer-learning
• Or use cutting edge methods like BERT / GPT-3
• can be used as feature vectors for ML model, used to measure text similarity using
cosine similarity techniques, words clustering and text classification techniques
Model Building & Training
• train_model Utility Function
• used to train a model
• Accepts (as inputs)
• the classifier
• feature_vector of training data
• labels of training data
• feature vectors of valid data
• Using inputs, model is trained and accuracy score is computed
Model Building & Training
• Naïve Bayes Classifier
• based on Bayes’ Theorem with an assumption of independence among
predictors (presence of a particular feature in a class is unrelated to the
presence of any other feature)
• Naïve Bayes on Count Vectors
• Naïve Bayes on TF-IDF Vectors
Model Building & Training
• SVM Classifier
• extracts a best possible hyper-plane / line that segregates the two classes
• SVM on Count Vectors
• SVM on TF-IDF Vectors
Model Building & Training
• Random Forest Classifier
• ensemble model (bagging method) - using decision tree as the individual
model
• Random Forest on Count Vectors
• Random Forest on TF-IDF Vectors

You might also like