NLP Transformer-Based Models Used For Sentiment Analysis
NLP Transformer-Based Models Used For Sentiment Analysis
DistilBERT: A smaller, faster, and cheaper version of BERT that is distilled from a larger
BERT model. It retains most of the performance of the original BERT while being
significantly more efficient.
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='whitegrid')
train = pd.read_csv('/kaggle/input/sentiment-analysis-dataset/trainin
g.csv',header=None)
validation = pd.read_csv('/kaggle/input/sentiment-analysis-dataset/va
lidation.csv',header=None)
display(train.isnull().sum())
print("*****"* 5)
display(validation.isnull().sum())
sentiment_counts_train = train['Sentiment'].value_counts()
sentiment_counts_validation = validation['Sentiment'].value_counts()
ax2.pie(sentiment_counts_validation, labels=sentiment_counts_valid
ation.index, autopct='%1.1f%%', colors=['gold', 'lightcoral', 'lightsky
blue','#99FF99'])
ax2.set_title('Sentiment Distribution (Validation Data)', fontsize=20)
plt.tight_layout()
plt.show()
# Calculate the value counts of 'Entity'
entity_counts = train['Entity'].value_counts()
top_names = entity_counts.head(19)
other_count = entity_counts[19:].sum()
top_names['Other'] = other_count
top_names.to_frame()
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
fig = go.Figure(data=[go.Pie(
labels=percentages.index,
values=percentages,
textinfo='label+percent',
insidetextorientation='radial'
)])
fig.update_layout(
title_text='Top Names with Percentages',
showlegend=False
)
fig.show()
import pandas as pd
from sklearn.model_selection import train_test_split
import pandas as pd
import plotly.graph_objects as go
fig.show()
import plotly.graph_objects as go
fig.update_layout(
title='First 5 Rows of Test Data',
width=1000,
height=500,
)
fig.show()
1. BERT (Bidirectional Encoder Representations from Transformers)
BERT is a groundbreaking language model that has significantly advanced the field of Natural
Language Processing (NLP).
It stands for Bidirectional Encoder Representations from Transformers.
Key Concepts
Bidirectional: Unlike previous models that processed text sequentially (left to right or right
to left), BERT considers the entire context of a word, both preceding and following it. This
enables a deeper understanding of language nuances.
Encoder: BERT focuses on understanding the input text rather than generating new text. It
extracts meaningful representations from the input sequence.
Transformers: The underlying architecture of BERT is based on the Transformer model,
known for its efficiency in handling long sequences and capturing dependencies between
words.
How BERT Works
Pre-training: BERT is initially trained on a massive amount of text data (like Wikipedia and
BooksCorpus) using two unsupervised tasks:
Masked Language Modeling (MLM): Randomly masks some words in the input and
trains the model to predict the masked words based on the context of surrounding
words.
Next Sentence Prediction (NSP): Trains the model to predict whether two given
sentences are consecutive in the original document.
Fine-tuning: After pre-training, BERT can be adapted to specific NLP tasks with minimal
additional training. This is achieved by adding a task-specific output layer to the pre-trained
model.
Advantages of BERT
Strong performance: BERT has achieved state-of-the-art results on a wide range of NLP
tasks, including question answering, text classification, named entity recognition, and
more.
Efficiency: Fine-tuning BERT for new tasks is relatively quick and requires less data compared
to training models from scratch.
Versatility: BERT can be applied to various NLP problems with minimal modifications.
Applications of BERT
Search engines: Improving search relevance and understanding user queries.
Chatbots: Enhancing natural language understanding and generating more human-like
responses.
Sentiment analysis: Accurately determining the sentiment expressed in text.
Machine translation: Improving the quality of translated text.
Text summarization: Generating concise summaries of lengthy documents.
In essence, BERT is a powerful language model that has revolutionized NLP by capturing the
bidirectional context of words and enabling efficient transfer learning for various tasks.
%%time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Ada
mW
from sklearn.metrics import accuracy_score, classification_report
def __len__(self):
return len(self.texts)
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
# Set up optimizer
optimizer = AdamW(model_BERT.parameters(), lr=2e-5)
# Training loop
num_epochs = 3
# Final evaluation
print(classification_report(test_true, test_preds, target_names=['Neg
ative', 'Neutral', 'Positive', 'Irrelevant']))
Benefits of RoBERTa
Improved Performance: RoBERTa consistently outperforms BERT on a wide range
of NLP tasks, achieving state-of-the-art results.
Efficiency: The modifications in RoBERTa lead to faster training and convergence.
Versatility: Like BERT, RoBERTa can be fine-tuned for various NLP tasks,
including text classification, question answering, and more.
Applications
Search Engines: Enhancing search relevance and understanding user queries.
Chatbots: Improving natural language understanding and generating more human-
like responses.
Sentiment Analysis: Accurately determining the sentiment expressed in text.
Machine Translation: Enhancing the quality of translated text.
Text Summarization: Generating concise summaries of lengthy documents.
In conclusion, RoBERTa is a powerful language model that builds upon the success of BERT
by incorporating several refinements. Its improved performance and versatility make it a
popular choice for various NLP applications.
%%time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, Ada
mW
from transformers import RobertaTokenizer, RobertaForSequenceClassificatio
n, AdamW
from sklearn.metrics import accuracy_score, classification_report
def __len__(self):
return len(self.texts)
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
# Training loop
num_epochs = 3
DistilBERT is a smaller and faster version of the BERT model. It's created using a technique
called knowledge distillation. This means that a smaller model (the student) learns to mimic
the behavior of a larger, more complex model (the teacher). In this case, the teacher is
BERT.
Key Features
Smaller size: DistilBERT is about 40% smaller than BERT, making it more efficient
in terms of memory and computation.
Faster: It's also significantly faster than BERT, making it suitable for real-time
applications.
Comparable performance: Despite its smaller size, DistilBERT retains about 95%
of BERT's language understanding capabilities.
How it Works
Knowledge Distillation: The process involves training DistilBERT to predict the
same outputs as BERT for a given input. However, instead of using hard labels (the
correct answer), DistilBERT is trained on softened outputs from BERT. This allows
the smaller model to learn more generalizable knowledge.
Architecture Simplification: Some architectural elements of BERT, such as the
token type embeddings, are removed to reduce complexity.
Advantages
Efficiency: Smaller size and faster inference speed make it suitable for resource-
constrained environments.
Cost-effective: Lower computational requirements lead to reduced training and
inference costs.
Good performance: Despite its smaller size, it maintains a high level of performance
on various NLP tasks.
Applications
Text classification: Sentiment analysis, topic modeling
Named entity recognition: Identifying entities in text (e.g., persons, organizations,
locations)
Question answering: Finding answers to questions based on given text
Text generation: Summarization, translation
In summary, DistilBERT offers a compelling balance between model size, speed, and
performance. It's a valuable tool for NLP practitioners looking to deploy models efficiently
without sacrificing accuracy.
%%time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertForSequenceClassi
fication, AdamW
from sklearn.metrics import accuracy_score, classification_report
def __len__(self):
return len(self.texts)
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
# Training loop
num_epochs = 3
torch.save(model_DistilBERT.state_dict(), 'sentiment_model_distilbert.pth')
# Final evaluation
print(classification_report(test_true, test_preds, target_names=['Neg
ative', 'Neutral', 'Positive', 'Irrelevant']))
ALBERT stands for A Lite BERT for Self-Supervised Learning. It's a language model
developed by Google AI, designed to be more efficient and effective than the original BERT
model.
Key Improvements Over BERT
Parameter Reduction: ALBERT significantly reduces the number of parameters
compared to BERT, making it more computationally efficient and faster to train. This
is achieved by:
Factorized embedding parameterization: Separating the embedding space into two
smaller spaces, reducing the number of parameters.
Cross-layer parameter sharing: Sharing parameters across different layers to reduce
redundancy.
Sentence-Order Prediction (SOP): Instead of the Next Sentence Prediction (NSP)
task used in BERT, ALBERT employs SOP. This task is more challenging and helps
the model better understand sentence relationships.
Architecture
ALBERT maintains the overall transformer architecture of BERT but incorporates the
aforementioned improvements. It consists of:
Embedding layer: Converts input tokens into numerical representations.
Transformer encoder: Processes the input sequence and captures contextual
information.
Output layer: Predicts the masked words and sentence order.
Benefits of ALBERT
Efficiency: ALBERT is significantly smaller and faster to train than BERT.
Improved Performance: Despite its smaller size, ALBERT often achieves better or
comparable performance to BERT on various NLP tasks.
Versatility: Like BERT, ALBERT can be fine-tuned for various NLP tasks.
Applications
Text classification: Sentiment analysis, topic modeling
Question answering: Answering questions based on given text
Named entity recognition: Identifying entities in text (e.g., persons, organizations,
locations)
Text summarization: Generating concise summaries of lengthy documents
In summary, ALBERT is a powerful language model that addresses some of the limitations
of BERT while maintaining its strengths. It offers a good balance between model size, speed,
and performance, making it a popular choice for various NLP applications.
%%time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AlbertTokenizer, AlbertForSequenceClassificatio
n, AdamW
from sklearn.metrics import accuracy_score, classification_report
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
# Set up optimizer
optimizer = AdamW(model_ALBERT.parameters(), lr=2e-5)
# Training loop
num_epochs = 3
# Final evaluation
print(classification_report(test_true, test_preds, target_names=['Negative',
'Neutral', 'Positive', 'Irrelevant']))
# Final evaluation
print(classification_report(test_true, test_preds, target_names=['Neg
ative', 'Neutral', 'Positive', 'Irrelevant']))
XLNet is a powerful language model that builds upon the successes of its predecessor,
BERT, while addressing some of its limitations.
It stands for "Extreme Language Model".
Key Differences from BERT
Autoregressive vs. Autoencoding: While BERT is an autoencoding model, XLNet is
an autoregressive model. This means that XLNet predicts the next token in a sequence
given the previous ones, similar to how we humans generate text. This approach
allows XLNet to capture bidirectional context without the limitations of BERT's
masked language modeling.
Permutation Language Model: XLNet introduces the concept of a permutation
language model. Instead of training on a fixed order of tokens, it considers all possible
permutations of the input sequence. This enables the model to learn dependencies
between any two tokens in the sequence, regardless of their position.
How XLNet Works
Permutation Language Modeling: XLNet randomly permutes the input sequence
and trains the model to predict the masked tokens in any position based on the context
of the remaining tokens.
Attention Mechanism: Similar to BERT, XLNet uses a self-attention mechanism to
capture dependencies between different parts of the input sequence.
Two-Stream Self-Attention: XLNet employs two streams of self-attention:
Content stream: Focuses on the content of the tokens.
Query stream: Focuses on the position of the tokens in the permutation.
Advantages of XLNet
Bidirectional Context: XLNet can capture bidirectional context more effectively
than BERT, leading to improved performance on various NLP tasks.
Flexibility: The permutation language modeling approach allows for more flexible
modeling of language.
Strong Performance: XLNet has achieved state-of-the-art results on many NLP
benchmarks.
Applications of XLNet
Text classification
Question answering
Natural language inference
Machine translation
Text summarization
In summary, XLNet is a significant advancement in the field of natural language processing,
offering improved performance and flexibility compared to previous models. Its ability to
capture bidirectional context effectively makes it a powerful tool for various NLP
applications.
%%time
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import XLNetTokenizer, XLNetForSequenceClassificatio
n, AdamW
from sklearn.metrics import accuracy_score, classification_report
def __len__(self):
return len(self.texts)
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_token_type_ids=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'token_type_ids': encoding['token_type_ids'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
# Set up optimizer
optimizer = AdamW(model_XLNet.parameters(), lr=2e-5)
# Training loop
num_epochs = 3
# Final evaluation
print(classification_report(test_true, test_preds, target_names=['Neg
ative', 'Neutral', 'Positive', 'Irrelevant']))
# Assuming test_true and test_preds are defined
from sklearn.metrics import confusion_matrix
# Set the width of each bar and the positions of the bars
width = 0.7
# Add value labels on top of each bar with increased font size
for i, v in enumerate(accuracy_trial_1):
ax.text(i, v + 0.2, f'{v:.1f}', ha='center', va='bottom', fontsize=16) #
Adjust vertical offset and format to one decimal place
# Add gridlines
ax.grid(axis='y', linestyle='--', alpha=0.9)
plt.tight_layout()
plt.show()