0% found this document useful (0 votes)
13 views66 pages

NLP Module 3

Uploaded by

sowmya17280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views66 pages

NLP Module 3

Uploaded by

sowmya17280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

SUBJECT CODE BY

22AI632 RACHEL E C
BITM, BALLARI
Text classification is the task of assigning one or more categories to a
given piece of text from a larger set of possible categories.
This task of categorizing texts based on some properties has a wide
range of applications across diverse domains, such as social media, e-
commerce, healthcare, law, and marketing etc.
Any supervised classification approach, includes three types based on
the number of categories involved:
1. Binary – 2 classes
2. Multiclass – more than 2 classes
3. Multilabel classification – one or more label/classes attached to it

Text classification is sometimes also referred to as topic classification,


text categorization, or document categorization.
APPLICATION

Content classification and organization

Customer support

E-commerce

Language identification

Authorship attribution

Triaging posts

Segregate fake news from real news


One typically follows these steps when building a text classification
system:
1. Collect or create a labeled dataset suitable for the task.
2. Split the dataset into two (training and test) or three parts: training,
validation (i.e., development), and test sets, then decide on evaluation
metric(s).
3. Transform raw text into feature vectors.
4. Train a classifier using the feature vectors and the corresponding
labels from the training set.
5. Using the evaluation metric(s) from Step 2, benchmark the model
performance on the test set.
6. Deploy the model to serve the real-world use case and monitor its
performance.
A Simple Classifier
Lexicon – based Sentiment Analysis
Bayes’ Theorem is used to determine the conditional probability of an
event.
It is used to find the probability of an event, based on prior
knowledge of conditions that might be related to that event.
Pr (A ∩ B | C) = Pr (A | C) Pr (B | C)
Types of Naïve Bayes Model

Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from
the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the
data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Naive Bayes classifier, it learns the probability of a text for each class
and chooses the one with maximum probability. Such a classifier is
called a generative classifier.
Logistic regression is an example of a discriminative classifier, as a
baseline in research, and as an MVP in real-world industry scenarios.

LR estimates probabilities based on feature occurrence in classes,


logistic regression “learns” the weights for individual features based
on how important they are to make a classification decision.

The goal of logistic regression is to learn a linear separator between


classes in the training data with the aim of maximizing the probability
of the data.

This “learning” of feature weights and probability distribution over all


classes is done through a function called “logistic” function, hence it is
called as logistic regression
Support Vector Machine
• A support vector machine (SVM), first invented in the early 1960s, is a
discriminative classifier.

• Support Vector Machine (SVM) is a powerful machine learning


algorithm used for linear or nonlinear classification, regression, and
even outlier detection tasks.

• It aims to look for an optimal hyperplane in a higher dimensional


space, which can separate the classes in the data by a maximum
possible margin.

• SVMs are capable of learning even non-linear separations between


classes
The main objective of the SVM algorithm is to find the
optimal hyperplane in an N-dimensional space that can separate the
data points in different classes in the feature space.

The hyperplane tries that the margin between the closest points of
different classes should be as maximum as possible.

If the number of input features is two, then the hyperplane is just a


line. If the number of input features is three, then the hyperplane
becomes a 2-D plane.

The best hyperplane is the one that represents the largest separation
or margin between the two classes.

The hyperplane whose distance from it to the nearest data point on


each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin.
Deep Learning for Text Classification
The steps involved in converting training and test data into a format
suitable for the neural network input layers:

1. Tokenize the texts and convert them into word index vectors.
2. Pad the text sequences so that all text vectors are of the same
length.
3. Map every word index to an embedding vector.
We do that by multiplying word index vectors with the embedding
matrix.
The embedding matrix can either be populated using pre-trained
embeddings or it can be trained for embeddings on this corpus.
4. Use the output from Step 3 as the input to a neural network
architecture.
The code snippet below illustrates Steps 1 and 2:
Step 3: We have to download them and use them to convert our data
into the input format for the neural networks
Step 4: DL architectures consist of an input layer, an output layer, and
several hidden layers in between the two. Depending on the
architecture, different hidden layers are used. The input layer
for textual input is typically an embedding layer. The output
layer, especially in the context of text classification, is a
softmax layer with categorical output.
CNNs for Text Classification
CNNs typically consist of a series of convolution and pooling layers as
the hidden layers.
CNNs can be thought of as learning the most useful bag-of-words/n-
grams features instead of taking the entire collection of words/n-
grams as features.
Word Embeddings: Each word in the text is represented as a dense
vector (embedding), often using pre-trained embeddings like
Word2Vec, GloVe, or contextual embeddings from models like BERT.

Filters/Kernels: The CNN applies convolutional filters (or kernels)


across the word embeddings. Each filter slides over the matrix of
word embeddings (representing the text) and performs convolution
operations. These filters detect local patterns or features in the text,
such as phrases or combinations of words that may be indicative of
certain classes.

Feature Maps: The result of applying a filter is a feature map, which


captures specific patterns or features from the text. For instance, a
filter might be tuned to recognize the presence of negations or
sentiment-laden phrases.
Max Pooling: After the convolution operation, a pooling layer is often
used to reduce the dimensionality of the feature maps and retain the
most important features. Max pooling, a common technique, involves
taking the maximum value from a set of features within a specified
window. This operation helps in capturing the most prominent features
and reduces the spatial size of the feature maps.

Global Max Pooling: For text classification, global max pooling might be
used to condense the entire feature map into a single vector by taking
the maximum value across all positions. This vector represents the most
salient features extracted by the convolutional layers.
Dense Layers: After pooling, the resulting feature vector is passed
through one or more fully connected (dense) layers. These layers are
responsible for combining the extracted features and making the final
classification decision.

Activation Function: Typically, the final layer uses an activation


function such as softmax (for multi-class classification) or sigmoid (for
binary classification) to produce the probability scores for each class.

Class Prediction: The output layer provides the final prediction, which
is usually a probability distribution over the possible classes. For
instance, if you’re classifying movie reviews as positive or negative,
the network will output probabilities indicating how likely the review
is to belong to each class.
Input Layer

Output Layer

Hidden Layer
Specifying the model, such as activation functions, hidden layers, layer
sizes, loss function, optimizer, metrics, epochs, and batch size.

We have the number of epochs as 10 or above. But that also increases


the amount of time it takes to train the model.

Another thing to note is that, if you want to train an embedding layer


instead of using pretrained embeddings in this model, the only thing
that changes is the line cnnmo del.add(embedding_layer).
Recurrent Neural Network (RNN)
Recurrent connection enables RNNs to maintain internal memory,
where the output of each step is fed back as an input to the next step,
allowing the network to capture the information from previous steps
and utilize it in the current step, enabling model to learn temporal
dependencies and handle input of variable length.
LSTMs for Text Classification

• Language is sequential in nature and RNNs are specialized in working


with sequential data.

• The current word in the sentence depends on its context— the words
before and after.

• RNNs work on the principle of using this context while learning the
language representation or a model of language.

• Long Short Term Memory (LSTM) is a special kind of Recurrent Neural


Network (RNN), capable of learning long-term dependencies.
•These long-term dependencies have a great influence on the meaning
and overall polarity of a document.

• Long short-term memory networks (LSTM) address this long-term


dependency problem by introducing a memory into the network.

• LSTM networks are designed to handle vanishing gradient problems


and learn long-term dependencies better than traditional RNNs.

• It was first introduced by Hochreiter & Schmidhuber.

• The LSTM architecture has a range of repeated modules for each


time step as in a standard RNN.
At each time step, the output of the LSTM module is controlled by a
set of gates, as a function of the old hidden state ℎ𝑡𝑡−1 and the input
at the current time step 𝑥𝑥 : the forget gate 𝑓𝑓, the input gate 𝑖𝑖 , and
the output gate 𝑂𝑂 .

These gates collectively decide how to update the current memory


cell 𝐶𝐶 and the current hidden state ℎ .

The LSTM transition functions are defined as follows:


𝑖𝑖𝑡𝑡=(𝑊𝑊𝑖𝑖[ℎ𝑡𝑡−1,𝑥𝑥𝑡𝑡]+𝑏𝑏𝑖𝑖)
𝐶𝐶´𝑡𝑡=𝑡𝑡𝑎𝑎𝑛𝑛ℎ(𝑊𝑊𝑐𝑐[ℎ𝑡𝑡−1,𝑥𝑥𝑡𝑡]+𝑏𝑏𝐶𝐶)
𝑓𝑓𝑡𝑡=(𝑊𝑊𝑓𝑓[ℎ𝑡𝑡−1,𝑥𝑥𝑡𝑡]+𝑏𝑏𝑓𝑓)
𝑂𝑂𝑡𝑡=(𝑊𝑊𝑜𝑜[ℎ𝑡𝑡−1,𝑥𝑥𝑡𝑡]+𝑏𝑏𝑜𝑜)
𝐶𝐶𝑡𝑡= 𝑓𝑓𝑡𝑡∗𝐶𝐶𝑡𝑡−1+𝑖𝑖𝑡𝑡∗𝐶𝐶´𝑡𝑡
Long Short Term Memory (LSTM) was designed to overcome the
problems of simple Recurrent Neural Network (RNN) by allowing the
network to store data in a sort of memory that it can access at a later
times.
The key of the LSTM model is the cell state.
The cell state is updated twice with few computations that resulting
stabilize gradients.
It has also a hidden state that acts like a short term memory.
The first step is to decide what information we’re going to throw away
from the cell state. This decision is made by a sigmoid layer called the
“Forget Gate” layer.
The second step is to decide what new information that we’re going
to store in the cell state. This has two parts.
First, a sigmoid layer called the “Input Gate” layer decides which
values we’ll update.
Next, a tanh layer which creates a vector of new candidate values that
could be added to the state.

Finally, we need to decide what we are going to give as output. This


output will be based on our cell state, but will be a filtered version.
First, we run a sigmoid layer which decides what parts of the cell state
we’re going to give as output.
Then, we put the cell state through tanh (to push the values to be
between -1 and 1) and multiply it by the output of the sigmoid gate,
so that we only output the parts we decided
Case Study: Corporate Ticketing
Imagine we’re asked to build a ticketing system for our organization
that will track all the tickets or issues people face in the organization
and route them to either internal or external agents.
Now let’s say our company has recently hired a medical counsel and
partnered with a hospital.
So our system should also be able to pinpoint any medical-related issue
and route it to the relevant people and teams.

1. Use existing APIs or libraries


2. Use public datasets
3. Utilize weak supervision
4. Active learning
5. Learning from implicit and explicit feedback
Phase 1: Initial Data Collection and Model

At the beginning of the project, there is no labeled data available to


train a text classification model.
The company needs a way to generate an initial dataset to kickstart the
model-building process.

1. Map Public API or Library


The team searches for public APIs or libraries that can provide relevant
data.

2. Map Public Dataset


Another approach is to find existing public datasets that are similar to
the corporate environment, such as datasets containing labeled
customer service tickets or product reviews. These datasets can
provide a base for understanding how to classify tickets.
3. Weak Supervision to Create Initial Dataset
Weak supervision involves using less accurate, noisy, or heuristic-based
methods to label the initial dataset.

4. Build Model
Using the initial dataset, the team builds a basic text classification
model. This model will likely be simple and less accurate but will serve
as a foundation for further development.

Phase 2: Improved Model with Continuous Iteration

5. Collect Explicit & Implicit Data


As the ticketing system is deployed, it collects explicit data (e.g., direct
feedback from users categorizing tickets) and implicit data (e.g.,
patterns in how tickets are resolved). This data is used to refine the
model.
6. Active Learning
Active learning involves the model selecting the most uncertain or
challenging cases and presenting them to human experts for labeling.
By focusing on the most difficult tickets, the model learns more
efficiently and improves its accuracy over time.

7. Analyze & Iterate


The team continually analyzes the model's performance, identifying
areas for improvement.
They iterate on the model, retraining it with newly collected and more
accurately labeled data.
This feedback loop ensures that the model becomes increasingly
reliable and effective at classifying tickets.

You might also like