0% found this document useful (0 votes)
9 views27 pages

Unit 3

Text classification models have diverse applications including spam detection, sentiment analysis, and news categorization, aiding in various sectors like healthcare and finance. The document outlines a pipeline for building text classification systems and discusses limitations of heuristic approaches, emphasizing the need for robust methodologies. It also covers different classifiers, neural embeddings, and deep learning architectures like CNNs and RNNs for effective text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views27 pages

Unit 3

Text classification models have diverse applications including spam detection, sentiment analysis, and news categorization, aiding in various sectors like healthcare and finance. The document outlines a pipeline for building text classification systems and discusses limitations of heuristic approaches, emphasizing the need for robust methodologies. It also covers different classifiers, neural embeddings, and deep learning architectures like CNNs and RNNs for effective text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

NLP-Unit-3 (TB)

Unit-3
Applications of Text classification models:

Text classification models have numerous real-world applications across various domains. Here are
some examples:

· Spam Detection: Text classification models are widely used to classify emails, messages, or
comments as spam or non-spam. By analyzing the content and features of incoming messages, these
models can help filter out unwanted or malicious communications.

· Sentiment Analysis: Businesses use sentiment analysis to analyze customer feedback, social
media posts, and product reviews to gauge public opinion and sentiment towards their products,
services, or brands. This information is valuable for reputation management, customer service
improvement, and market research.

· News categorization: News articles can be automatically categorized into different topics or
genres using text classification models. This helps news organizations and aggregators organize
their content and provide users with personalized news recommendations.

· Customer Support Ticket Routing: Text classification models can classify customer support
tickets based on their content and urgency, allowing organizations to prioritize and route tickets to
the appropriate departments or agents for timely resolution.

· Legal Document Classification: Law firms and legal departments can use text classification
models to categorize and organize legal documents, such as contracts, case files, and court rulings.
This improves document management efficiency and facilitates information retrieval.

· Medical Document Classification: Text classification models are employed in healthcare to


classify medical records, patient notes, and research articles into categories such as diagnosis,
treatment, and prognosis. This assists healthcare professionals in decision-making and patient care.

· Topic Modeling and Content Recommendation: Text classification models can be used to
automatically identify topics or themes in large collections of documents, enabling content
recommendation systems to suggest relevant articles, videos, or products to users based on their
interests and preferences.

· Fake News Detection: With the proliferation of misinformation and fake news, text
classification models are utilized to distinguish between credible and unreliable news sources or
articles. This helps combat the spread of false information and promotes media literacy.

· Financial Document Analysis: Text classification models are employed in finance to classify
financial reports, market news, and economic indicators, helping investors and analysts make
informed decisions and identify trends in financial markets.

· Legal Compliance and Regulatory Monitoring: Text classification models can assist
organizations in monitoring regulatory compliance by automatically classifying legal documents,
policies, and contracts to ensure adherence to relevant laws and regulations.

Pipeline for Building Text Classification System


1
NLP-Unit-3 (TB)

1. Collect or create a labeled dataset suitable for the task.


2. Split the dataset into two (training and test) or three parts: training, validation and test sets
3. Transform raw text into feature vectors.
4. Train a classifier using the feature vectors and the corresponding labels from the training set.
5. Use the evaluation metric(s) to assess model performance on the test set.
6. Deploy the model and monitor its performance.

Limitations of heuristic approach in text classification:

The heuristic approach in text classification involves the use of handcrafted rules, patterns, or features
based on domain knowledge or intuition to classify text documents into predefined categories. While
heuristic approaches have been widely used in text classification tasks, they come with several
limitations:

· Subjectivity and Bias: Heuristic rules are often created based on the subjective
understanding and intuition of domain experts or developers. This can lead to the introduction
of biases and inconsistencies in the classification process, as different individuals may have
different interpretations of the text data and may create rules differently.

· Limited Generalization: Heuristic rules are typically designed to address specific


patterns or characteristics observed in the training data. As a result, they may lack
generalization to unseen or varied text data, leading to poor performance when applied to new
datasets or domains. Heuristic approaches often struggle to adapt to evolving language usage,
slang, or new topics that were not accounted for in the rule design.

· Scalability and Maintenance: Heuristic approaches can become cumbersome to scale


and maintain, especially as the complexity of the classification task increases or the volume
of text data grows. Manually crafting and updating rules to handle diverse text inputs can be
time-consuming and labor-intensive, making it challenging to maintain the effectiveness of
the classification system over time.

2
NLP-Unit-3 (TB)

· Difficulty in Handling Complex Relationships: Text data often exhibits complex


relationships and nuances that may be challenging to capture with simple heuristic rules.
Heuristic approaches may struggle to model intricate dependencies between words, phrases,
or concepts in the text, leading to suboptimal performance, especially in tasks requiring high
precision and recall.

· Robustness to Noise and Variability: Heuristic rules may be sensitive to noise,


variability, or linguistic ambiguity present in the text data. Minor variations in language use,
spelling errors, grammatical inconsistencies, or textual noise can disrupt the effectiveness of
heuristic-based classifiers, leading to errors in classification results.

· Limited Adaptability and Flexibility: Heuristic rules are often static and inflexible,
lacking the ability to adapt dynamically to changing data distributions, user preferences, or
task requirements. As a result, heuristic-based classifiers may struggle to maintain
performance in dynamic or evolving environments, where continuous adaptation and learning
are necessary.

· Lack of Explainability: Heuristic rules may lack transparency and interpretability,


making it difficult to understand the reasoning behind classification decisions. Unlike
machine learning models that can provide insights into feature importance or decision-
making processes, heuristic-based classifiers often operate as black-box systems, limiting the
ability to diagnose errors or refine the classification criteria.

One Pipeline, Many Classifiers

The "one pipeline, many classifiers" approach in machine learning refers to a methodology where a
single data preprocessing and feature extraction pipeline is applied consistently across multiple
classification algorithms. The purpose is to compare the performance of different classifiers on the same
dataset or to choose the best model for deployment.

Working

1. Unified Preprocessing Pipeline:


○ All input data undergoes the same preprocessing steps, such as:
■ Handling missing values
■ Encoding categorical variables (e.g., one-hot encoding)
■ Feature scaling or normalization
■ Dimensionality reduction (optional)
2. Feature Extraction (if applicable):
○ Transform the preprocessed data into a format suitable for machine learning, such as:
■ Bag-of-Words or TF-IDF for text data
■ Principal Component Analysis (PCA) for dimensionality reduction
3. Multiple Classifiers:
○ The preprocessed and feature-engineered data is fed into multiple classification
algorithms, such as:
■ Logistic Regression
■ Decision Trees
3
NLP-Unit-3 (TB)

■ Random Forests
■ Support Vector Machines (SVM)
■ Neural Networks
■ Gradient Boosting (e.g., XGBoost, LightGBM)
4. Evaluation:
○ Each classifier's performance is assessed using a consistent evaluation metric (e.g.,
accuracy, F1-score, ROC-AUC).
○ Cross-validation is often employed to ensure robust evaluation.
5. Comparison and Selection:
○ The classifier with the best performance for the specific task is selected for further use or
deployment.

We can build text classifiers by altering Step 3 in the pipeline and keeping the remaining steps constant.

1. Naive Bayes Classifier


● Naive Bayes is a probabilistic classifier that uses Bayes’ theorem to classify texts based on the
evidence seen in training data.
● It estimates the conditional probability of each feature of a given text for each class based on the
occurrence of that feature in that class and multiplies the probabilities of all the features of a
given text to compute the final probability of classification for each class.
● Finally, it chooses the class with maximum probability.

2. Logistic Regression
● Logistic regression “learns” the weights for individual features based on how important they are
to make a classification decision.
● The goal of logistic regression is to learn a linear separator between classes in the training data
with the aim of maximizing the probability of the data.
● This “learning” of feature weights and probability distribution over all classes is done through a
function called “logistic” function, and (hence the name) logistic regression.

3. Support Vector Machine


● It aims to look for an optimal hyperplane in a higher dimensional space, which can separate the
classes in the data by a maximum possible margin.
● Further, SVMs are capable of learning even non-linear separations between classes, unlike
logistic regression.
● However, they may also take longer to train.

Using Neural Embeddings in Text Classification

Word Embeddings

4
NLP-Unit-3 (TB)

● Neural network–based architectures have become popular for “learning” word representations,
which are known as “word embeddings.”
● There are several pre-trained Word2vec models trained on large corpora available on the internet.

● In the Doc2vec embedding scheme, we learn a direct representation for the entire document
(sentence/paragraph) rather than each word. Just as we used word and character embeddings as
features for performing text classification, we can also use Doc2vec as a feature representation
mechanism. There are no existing pretrained models that work with the latest version of Doc2vec.

Deep Learning for Text Classification

Two of the most commonly used neural network architectures for text classification are convolutional
neural networks (CNNs) and recurrent neural networks (RNNs). Long short-term memory (LSTM)
networks are a popular form of RNNs.

1. Tokenize the texts and convert them into word index vectors.

2. Pad the text sequences so that all text vectors are of the same length.

3. Map every word index to an embedding vector. We do that by multiplying word index vectors with the
embedding matrix. The embedding matrix can either be populated using pre-trained embeddings or it can
be trained for embeddings on this corpus.

4. Use the output from Step 3 as the input to a neural network architecture.

CNNs for Text Classification

● CNNs typically consist of a series of convolution and pooling layers as the hidden layers.

● In the context of text classification, CNNs can be thought of as learning the most useful bag-of-
words/n-grams features instead of taking the entire collection of words/n-grams as features.
● Eg. Since our dataset has only two classes—positive and negative—the output layer has two
outputs, with the softmax activation function.
● CNN is defined with convolution-pooling layers using any of the models (Sequential model class
in Keras, which allows us to specify DL models as a sequential stack of layers—one after
another).
● Once the layers and their activation functions are specified, the next task is to define other
important parameters, such as the optimizer, loss function to tune the hyperparameters of the
model.
● Once all this is done, the next step is to train and evaluate the model.

The major steps in CNN are as follows:


● Padding
● Convolution
● Pooling

5
NLP-Unit-3 (TB)

● Flattening

Padding in CNN

The kernel scans the border less times as compared to the middle pixels.
Here, padding comes into the picture. In padding as we have discussed above, we apply extra pixels.

● Padding adds extra space or characters to the beginning or end of a string to ensure it meets
a specific length requirement.
● Left Padding: Adds space or characters to the left side of the string, pushing the original
content to the right.
● Right Padding: Adds space or characters to the right side of the string, shifting the original
content to the left.
● Purpose: Padding is commonly used in formatting data, such as aligning text in tables or
ensuring consistent display widths.
● Padding is especially important when dealing with variable-length text sequences.
● While processing text with a CNN, and sentence level embeddings are used then, each
sentence is typically converted into a numerical representation (e.g., word embeddings) and
then padded with zeros to make all sequences(as sentence lengths are not same) the same
length

A Convolutional Neural Network is made up of two main layers:

● A convolution layer for obtaining features from the data

● A pooling layer for reducing the size of the feature map

⮚ Convolution
● Convolution is the process through which features are obtained. These features are then
fed to the CNN.
● The convolution operation is responsible for detecting the most important features.

6
NLP-Unit-3 (TB)

● The output of the convolution operation is known as a feature map, a convolved feature,
or an activation map.
● The feature map is computed through the application of a feature detector to the input
data.
● In some deep learning frameworks, the feature detector is also referred to as a kernel or a
filter. For instance, this filter can be a 3 by 3 matrix.
● The feature map is computed through an element-wise multiplication of the kernel and
the matrix representation of the input data.
● This ensures that the feature map passed to the CNN is smaller but contains all the
important features.
● The filter does this by sliding step by step through every element in the input data.

7
NLP-Unit-3 (TB)

Example of Convolution Operation

8
NLP-Unit-3 (TB)

⮚ Pooling

● Pooling works by placing a matrix, say a 2 by 2 matrix, on the feature map and
performing a certain operation.
● For instance, in max-pooling, the maximum number falling within that matrix is picked.
In average pooling, the average of the numbers falling within that matrix is computed.
In min pooling, the minimum number falling in that matrix is used.
● The result is known as a pooled feature map. Pooling ensures that the size of the data
passed to the CNN is reduced further.

⮚ Flattening
Flattening involves converting the pooled feature map into a single column that will be passed to
the fully connected layer.

● The flattened feature map becomes the input of the neural network. This is then passed to a fully
connected layer.
● Based on the application (binary classification/ muti-class classification) activation function is
applied at the output layer.

9
NLP-Unit-3 (TB)

Recurrent Neural Network

10
NLP-Unit-3 (TB)

In each cell the input of the current time step x (present value), the hidden state h of the
previous time step (past value) and a bias are combined and then limited by an activation
function to determine the hidden state of the current time step.

Input: x(t) is taken as the input to the network at time step t. For example, x1,could
be a one-hot vector corresponding to a word of a sentence.

Hidden state: h(t) represents a hidden state at time t and acts as “memory” of the
network. h(t) is calculated based on the current input and the previous time step’s
hidden state:

11
NLP-Unit-3 (TB)

h(t) = f(Wx x(t) + Wh


h(t1)). Bias can be added to the function. The function f is taken to be a non-linear transf

Weights: The RNN has input to hidden connections parameterized by a weight


matrix U, hidden-to-hidden recurrent connections parameterized by a weight
matrix W.

Steps in Training through RNN

1. A single-time step of the input is provided to the network.

2. Then calculate its current state using a set of current input and the previous state.

3. The current ht becomes ht-1 for the next time step.

4. Once all the time steps are completed the final current state is used to calculate the
output.

5. One or more dense layers are then added after the recurrent layers to convert the
learned features into the appropriate output format.

6. When predicting the next word, the dense layers translate the hidden representations to
a probability distribution across the vocabulary, indicating the likelihood that each word
will be the following word.

7. This output is then compared to the actual output i.e the target output and the error is
generated.

8. The error is then back-propagated to the network to update the weights and hence the
network (RNN) is trained using Backpropagation through time.

Types of RNNs
1. One to One: This is also called Vanilla Neural Network. It is used in such machine learning problems
where it has a single input and single output. The examples are Image Classification, Character
Generation.

2. One to Many: It has a single input and multiple outputs. An example is Music Generation.
12
NLP-Unit-3 (TB)

3. Many to One: RNN takes a sequence of inputs and produces a single output. The examples are
Sentiment classification, Text Classification, prediction of the next word.

4. Many to Many: RNN takes a sequence of inputs and produces a sequence of outputs. For example,
Language Translation.

RNN as Language Translator/ Time Series (Additional/ Extra Material)

The above architecture is similar to the architecture that is given below:

13
NLP-Unit-3 (TB)

RNN as a Sequence Classifier/ next word prediction (Additional/ Extra Material)

14
NLP-Unit-3 (TB)

The RNN architecture laid the foundation for ML models to have language processing capabilities. Several
variants have emerged that share its memory retention principle and improve on its original functionality. The
following are some examples.

Bidirectional recurrent neural networks


A bidirectional recurrent neural network (BRNN) processes data sequences with forward and backward layers
of hidden nodes. The forward layer works similarly to the RNN, which stores the previous input in the hidden
state and uses it to predict the subsequent output. Meanwhile, the backward layer works in the opposite
direction by taking both the current input and the future hidden state to update the present hidden state.
Combining both layers enables the BRNN to improve prediction accuracy by considering past and future
contexts. For example, you can use the BRNN to predict the word trees in the sentence Apple trees are tall.

(OR)

Long short-term memory

15
NLP-Unit-3 (TB)

Long short-term memory (LSTM) is an RNN variant that enables the model to expand its memory capacity to
accommodate a longer timeline. An RNN can only remember the immediate past input. It can’t use inputs from
several previous sequences to improve its prediction.

Consider the following sentences: Tom is a cat. Tom’s favorite food is fish. When you’re using an RNN, the
model can’t remember that Tom is a cat. It might generate various foods when it predicts the last word. LSTM
networks add a special memory block called cells in the hidden layer. Each cell is controlled by an input gate,
output gate, and forget gate, which enables the layer to remember helpful information. For example, the cell
remembers the words Tom and cat, enabling the model to predict the word fish.

Gated recurrent units


A gated recurrent unit (GRU) is an RNN that enables selective memory retention. The model adds an update
and forgets the gate to its hidden layer, which can store or remove information in the memory.
● The gradient computation involves performing a forward propagation pass moving left to right of
the unrolled graph, followed by a backward propagation pass moving right to left through the
graph.
● The runtime is O(τ ) and cannot be reduced by parallelization because the forward propagation
graph is inherently sequential; each time step may only be computed after the previous one.
● States computed in the forward pass must be stored until they are reused during the backward
pass, so the memory cost is also O(τ).
● The back-propagation algorithm applied to the unrolled graph with O(τ) cost is called back-
propagation through time or BPTT.
● The network with recurrence between hidden units is thus very powerful but also expensive to
train.

Limitations of recurrent neural networks

● An RNN processes data sequentially, which limits its ability to process a large number of texts
efficiently.
● For example, an RNN model can analyze a buyer’s sentiment from a couple of sentences.
● However, it requires massive computing power, memory space, and time to summarize a page of an
essay.

Transformers overcome the limitations of recurrent neural networks.

● Transformers are deep learning models that use self-attention mechanisms in an encoder-
decoder feed-forward neural network. They can process sequential data the same way
that RNNs do.
● Transformers don’t use hidden states to capture the interdependencies of data sequences.
Instead, they use a self-attention head to process data sequences in parallel. This enables
transformers to train and process longer sequences in less time than an RNN does. With
the self-attention mechanism, transformers overcome the memory limitations and
sequence interdependencies that RNNs face.
● Transformers can process data sequences in parallel and use positional encoding to
remember how each input relates to others.

16
NLP-Unit-3 (TB)

● By processing all input sequences simultaneously, a transformer isn’t subjected to


backpropagation restrictions because gradients can flow freely to all weights. Parallelism
enables transformers to scale massively and handle complex NLP tasks by building larger
models.

LSTMs for Text Classification

● LSTMs and other variants of RNNs is one more way way of doing neural language modeling in
the past few years.
● This is primarily because language is sequential in nature and RNNs are specialized in working
with sequential data.
● The current word in the sentence depends on its context—the words before and after.

● However, when we model text using CNNs, this crucial fact is not taken into account. RNNs
work on the principle of using this context while learning the language representation or a model
of language.
● Hence, they’re known to work well for NLP tasks.

In LSTM there are Forget Gate, Input Gate and Output Gate.

17
NLP-Unit-3 (TB)

The first step is to decide what information we’re going to throw away from the cell state. This decision is
made by a sigmoid layer called the “Forget Gate” layer.

The second step is to decide what new information that we’re going to store in the cell state. This has two
parts. First, a sigmoid layer called the “Input Gate” layer decides which values we’ll update. Next, a tanh
layer which creates a vector of new candidate values that could be added to the state.

Finally, we need to decide what we are going to output. This output will be based on our cell state, but
will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re
going to output. Then, we put the cell state through tanh (to push the values to be between -1 and 1) and
multiply it by the output of the sigmoid gate, so that we only output the parts we decided

LSTM (Explanation of the Intuition)

Cell State

The LSTM has the ability to remove or add information to the cell state, which is regulated by
gates.
18
NLP-Unit-3 (TB)

Forget gate

The first step in LSTM is to decide what information we’re going to throw away from the cell state. This decisio

Current Cell State or Input Layer

19
NLP-Unit-3 (TB)

The next step is to decide what new information we’re going to store in the cell state. This has
two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update.
Next, a tanh layer creates a vector of new candidate values, Ct , that could be added to the state.

We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add itCt. This is the n

20
NLP-Unit-3 (TB)

Output Layer

First, run a sigmoid layer which decides what parts of the cell state are going to output. Then, put the cell state th

21
NLP-Unit-3 (TB)

LSTM Model for Classification

Basic LSTM model receives the words sequentially, they are processed at the next time stamp as
output from each cell.

Bi-LSTM Model for Classification

● It combines the power of LSTM with bidirectional processing. So, it allows the model to
capture both past and future context of the input sequence.
● It has two passes: forward pass, backward pass.

22
NLP-Unit-3 (TB)

● During the forward pass, the input sequence is fed into the forward LSTM layer from the
first time step to the last. At each time step, the forward LSTM computes its hidden state
and updates its memory cell based on the current input and the previous hidden state and
memory cell.
● The input sequence is also fed into the backward LSTM layer in reverse order, from the
last time step to the first. The backward LSTM also computes its hidden state and updates
its memory cell based on the current input and the previous hidden state and memory cell.
● Once both the passes are complete, the outputs from each of the LSTM cells are
combined. This combination can be concatenating or applying some other
transformation.

Note: For certain Classification applications like Sentiment Analysis or binary classification, we
may require the final output from all the cells (capturing the entire context). It can be further given
to Softmax activation. (This slight model change can be done with either LSTM or Bi-LSTM)

23
NLP-Unit-3 (TB)

Differences among RNN, LSTM, GRU

24
NLP-Unit-3 (TB)

Learning with No or Less Data

1. No Training Data
● Eg. The classifier is expected to automatically route customer complaint emails into a set of
categories: billing, delivery, and others.
● If we’re fortunate, we may discover a source of large amounts of annotated data for this task.

● If such a database doesn’t exist, where should we start to build our classifier?

⮚ In such a scenario, create an annotated dataset where customer complaints are mapped to
the set of categories mentioned above.

25
NLP-Unit-3 (TB)

⮚ Or get customer service agents to manually label some of the complaints and use that as
the training data for our ML model.
⮚ Another approach is called “bootstrapping” or “weak supervision.”
Less Training Data:
● Active Learning and Domain Adaptation

● In human annotations or bootstrapping, it may sometimes a good classification model cannot be


built.
● It’s also possible that most of the requests we collected belonged to billing and very few belonged
to the other categories, resulting in a highly imbalanced dataset.
● Asking the agents to spend many hours doing manual annotation is not always feasible.

● What should we do in such scenarios?

● One approach to address such problems is active learning, which is primarily about identifying
which data points are more crucial to be used as training data.

Active Learning

1. Train the classifier with the available amount of data.

2. Start using the classifier to make predictions on new data.

3. For the data points where the classifier is very unsure of its predictions, send them to human annotators
for their correct classification.

4. Include these data points in the existing training data and retrain the model.

Repeat Steps 1 through 4 until a satisfactory model performance is reached.

Tools like Prodigy have active learning solutions implemented for text classification.

Domain Adaptation

● For eg., the model is trained on electronic products and we’re using it for complaints on cosmetic
products, the pre-trained classifiers trained on some other source data are unlikely to perform
well.
● Domain adaptation is a method to address such scenarios; this is also called transfer learning.

● Here, we “transfer” what we learned from one domain (source) with large amounts of data to
another domain (target) with less labeled data but large amounts of unlabeled data.

Domain adaptation in text classification:


26
NLP-Unit-3 (TB)

1. Start with a large, pre-trained language model trained on a large dataset of the source domain (e.g.,
Wikipedia data).

2. Fine-tune this model using the target language’s unlabeled data.

3. Train a classifier on the labeled target domain data by extracting feature representations from the
fine-tuned language model from Step 2.

ULMFit is another popular domain adaptation approach for text classification.

27

You might also like