Text classification is a widely used natural language processing task to automatically categorize text documents into predefined categories. The key steps involve preparing text data, engineering features from the text like counts, TF-IDF, word embeddings, training classifiers like Naive Bayes, SVM, Random Forest on the features, and evaluating the model performance. An example is classifying Amazon reviews as positive or negative sentiment based on the text.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
33 views13 pages
ML7 - Text Classification
Text classification is a widely used natural language processing task to automatically categorize text documents into predefined categories. The key steps involve preparing text data, engineering features from the text like counts, TF-IDF, word embeddings, training classifiers like Naive Bayes, SVM, Random Forest on the features, and evaluating the model performance. An example is classifying Amazon reviews as positive or negative sentiment based on the text.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13
Text Classification
REFERENCES
A Comprehensive Guide to Understand and Implement Text Classificatio
n in Python (analyticsvidhya.com)
Natural Language Processing | NLP in Python | NLP Libraries (analyticsvi
dhya.com) Text Classification • Widely used Natural Language Processing (NLP) task • Goal: Automatically classify text documents into one or more defined categories (Supervised ML Algorithm) • Examples of Text Classification • Understanding audience sentiment from social media • Detection of spam and non-spam emails • Auto tagging of customer queries • Categorization of news articles into defined topics Text Classification Pipeline • Dataset Preparation • Loading dataset and performing basic pre-processing • Dataset is then split into training and test/validation sets • Feature Engineering • Raw dataset transformed into flat features which can be used in a ML model • Plus process of creating new features from existing data • Model Training • A ML model is trained on a labelled dataset
• Our Example Problem Statement
• To detect sentiment in Amazon Reviews • Classify Reviews (‘0’: negative or ‘1’: positive) Data Preparation • Dataset • Amazon Reviews (10,000 in number) • Filename: ‘corpus’ • Text (variable length review) & Label (Label_1: Negative review; Label_2: Positive review) • Load dataset & create dataframe using texts & labels • Split the dataset into training & validation sets & encode the target variable Feature Engineering • Raw text data transformed into feature vectors and new features created using the existing dataset • Different Methods / Ways of extracting vectors from text • Count Vectors as features • TF-IDF Vectors as features • Word Embeddings as features Feature Engineering • Count Vectors as features (Bag-of-Words) • Matrix notation of the dataset in which • every row represents a document from the corpus • every column represents a term from the corpus • every cell represents the frequency count of a particular term in a particular document Feature Engineering • TF-IDF as features (advanced version of Bag-of-Words) • TF-IDF score represents the relative importance of a term in the document and the entire corpus (words with higher scores of weight are deemed to be more significant) • TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document) • IDF(t) = log_e(Total number of documents / Number of documents with term t in it) • Value of a word increases proportionally to count in the document, but it is inversely proportional to the frequency of the word in the corpus Document 1 It is going to rain today. Document 2 Today I am not going outside. Document 3 I am going to watch the season premiere. Feature Engineering Word Embeddings (Word2Vec) • (188) What is Word2Vec? A Simple Explanation | Deep Learning Tutorial 41 ( Tensorflow, Keras & Python) – YouTube • Word embedding • form of representing words using a dense vector representation • position of a word within the vector space is learned from text and is based on the words that surround the word (context) • can be trained using the input corpus itself • can be generated using pre-trained word embeddings such as GloVe, FastText, and Word2Vec using transfer-learning • Or use cutting edge methods like BERT / GPT-3 • can be used as feature vectors for ML model, used to measure text similarity using cosine similarity techniques, words clustering and text classification techniques Model Building & Training • train_model Utility Function • used to train a model • Accepts (as inputs) • the classifier • feature_vector of training data • labels of training data • feature vectors of valid data • Using inputs, model is trained and accuracy score is computed Model Building & Training • Naïve Bayes Classifier • based on Bayes’ Theorem with an assumption of independence among predictors (presence of a particular feature in a class is unrelated to the presence of any other feature) • Naïve Bayes on Count Vectors • Naïve Bayes on TF-IDF Vectors Model Building & Training • SVM Classifier • extracts a best possible hyper-plane / line that segregates the two classes • SVM on Count Vectors • SVM on TF-IDF Vectors Model Building & Training • Random Forest Classifier • ensemble model (bagging method) - using decision tree as the individual model • Random Forest on Count Vectors • Random Forest on TF-IDF Vectors