Sentiment Classification Using BERT
Last Updated :
15 Jul, 2025
BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architectures for various natural language tasks having generated state-of-the-art results on Sentence pair classification tasks, question-answer tasks, etc.
Bidirectional Representation for Transformers (BERT)
BERT is a powerful technique for natural language processing that can improve how well computers comprehend human language. The foundation of BERT is the idea of exploiting bidirectional context to acquire complex and insightful word and phrase representations. By simultaneously examining both sides of a word's context, BERT can capture a word's whole meaning in its context, in contrast to earlier models that only considered the left or right context of a word. This enables BERT to deal with ambiguous and complex linguistic phenomena including polysemy, co-reference, and long-distance relationships.
For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for Sentiment classification tasks specifically the architecture used for the CoLA (Corpus of Linguistic Acceptability) binary classification task.
Single Sentence Classification TaskBERT has proposed two versions:
- BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
- BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.
For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercase before WordPiece tokenization.
Sentiment Classification Using BERT:
Step 1: Import the necessary libraries
Python
import os
import shutil
import tarfile
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import pandas as pd
from bs4 import BeautifulSoup
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
Step 2: Load the dataset
Python
# Get the current working directory
current_folder = os.getcwd()
dataset = tf.keras.utils.get_file(
fname ="aclImdb.tar.gz",
origin ="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
cache_dir= current_folder,
extract = True)
Output
Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84125825/84125825 [==============================] - 12s 0us/step
check the dataset folder
Python
dataset_path = os.path.dirname(dataset)
# Check the dataset
os.listdir(dataset_path)
Output:
['aclImdb.tar.gz', 'aclImdb']
Check the 'aclImdb' directory
Python
# Dataset directory
dataset_dir = os.path.join(dataset_path, 'aclImdb')
# Check the Dataset directory
os.listdir(dataset_dir)
Output:
['README', 'test', 'imdb.vocab', 'imdbEr.txt', 'train']
Check the 'Train' dataset folder
Python
train_dir = os.path.join(dataset_dir,'train')
os.listdir(train_dir)
Output:
['urls_pos.txt',
'urls_neg.txt',
'labeledBow.feat',
'neg',
'unsup',
'unsupBow.feat',
'urls_unsup.txt',
'pos']
Read the files of the 'Train' directory files
Python
for file in os.listdir(train_dir):
file_path = os.path.join(train_dir, file)
# Check if it's a file (not a directory)
if os.path.isfile(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
first_value = f.readline().strip()
print(f"{file}: {first_value}")
else:
print(f"{file}: {file_path}")
Output:
urls_pos.txt: https://www.imdb.com/title/tt0453418/usercomments
urls_neg.txt: https://www.imdb.com/title/tt0064354/usercomments
labeledBow.feat: 9 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 27:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:5 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:4 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1
neg: /content/datasets/aclImdb/train/neg
unsup: /content/datasets/aclImdb/train/unsup
unsupBow.feat: 0 0:8 1:6 3:5 4:2 5:1 7:1 8:5 9:2 10:1 11:2 13:3 16:1 17:1 18:1 19:1 22:3 24:1 26:3 28:1 30:1 31:1 35:2 36:1 39:2 40:1 41:2 46:2 47:1 48:1 52:1 63:1 67:1 68:1 74:1 81:1 83:1 87:1 104:1 105:1 112:1 117:1 131:1 151:1 155:1 170:1 198:1 225:1 226:1 288:2 291:1 320:1 331:1 342:1 364:1 374:1 384:2 385:1 407:1 437:1 441:1 465:1 468:1 470:1 519:1 595:1 615:1 650:1 692:1 851:1 937:1 940:1 1100:1 1264:1 1297:1 1317:1 1514:1 1728:1 1793:1 1948:1 2088:1 2257:1 2358:1 2584:2 2645:1 2735:1 3050:1 4297:1 5385:1 5858:1 7382:1 7767:1 7773:1 9306:1 10413:1 11881:1 15907:1 18613:1 18877:1 25479:1
urls_unsup.txt: https://www.imdb.com/title/tt0018515/usercomments
pos: /content/datasets/aclImdb/train/pos
Load the Movies reviews and convert them into the pandas' data frame with their respective sentiment
Here 0 means Negative and 1 means Positive
Python
def load_dataset(directory):
data = {"sentence": [], "sentiment": []}
for file_name in os.listdir(directory):
print(file_name)
if file_name == 'pos':
positive_dir = os.path.join(directory, file_name)
for text_file in os.listdir(positive_dir):
text = os.path.join(positive_dir, text_file)
with open(text, "r", encoding="utf-8") as f:
data["sentence"].append(f.read())
data["sentiment"].append(1)
elif file_name == 'neg':
negative_dir = os.path.join(directory, file_name)
for text_file in os.listdir(negative_dir):
text = os.path.join(negative_dir, text_file)
with open(text, "r", encoding="utf-8") as f:
data["sentence"].append(f.read())
data["sentiment"].append(0)
return pd.DataFrame.from_dict(data)
Load the training datasets
Python
# Load the dataset from the train_dir
train_df = load_dataset(train_dir)
print(train_df.head())
Output:
urls_pos.txt
urls_neg.txt
labeledBow.feat
neg
unsup
unsupBow.feat
urls_unsup.txt
pos
sentence sentiment
0 When I rented this movie, I had very low expec... 0
1 'Major Payne' is a film about a major who make... 0
2 I'd been following this films progress for qui... 0
3 Although the beginning suggests All Quiet on t... 0
4 Cabin Fever is the first feature film directed... 0
Load the test dataset respectively
Python
test_dir = os.path.join(dataset_dir,'test')
# Load the dataset from the train_dir
test_df = load_dataset(test_dir)
print(test_df.head())
Output:
urls_pos.txt
urls_neg.txt
labeledBow.feat
neg
pos
sentence sentiment
0 The movie is nothing extraordinary. As a matte... 0
1 Rented the video with a lot of expectations, b... 0
2 The first time I saw a commercial for this sho... 0
3 We can conclude that there are 10 types of peo... 0
4 I seem to remember a lot of hype about this mo... 0
Step 3: Preprocessing
Python
sentiment_counts = train_df['sentiment'].value_counts()
fig =px.bar(x= {0:'Negative',1:'Positive'},
y= sentiment_counts.values,
color=sentiment_counts.index,
color_discrete_sequence = px.colors.qualitative.Dark24,
title='<b>Sentiments Counts')
fig.update_layout(title='Sentiments Counts',
xaxis_title='Sentiment',
yaxis_title='Counts',
template='plotly_dark')
# Show the bar chart
fig.show()
pyo.plot(fig, filename = 'Sentiments Counts.html', auto_open = True)
Output:
Sentiment CountsText Cleaning
Python
def text_cleaning(text):
soup = BeautifulSoup(text, "html.parser")
text = re.sub(r'\[[^]]*\]', '', soup.get_text())
pattern = r"[^a-zA-Z0-9\s,']"
text = re.sub(pattern, '', text)
return text
Apply text_cleaning
Python
# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning).tolist()
# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)
Plot reviews on WordCLouds
Python
# Function to generate word cloud
def generate_wordcloud(text,Title):
all_text = " ".join(text)
wordcloud = WordCloud(width=800,
height=400,
stopwords=set(STOPWORDS),
background_color='black').generate(all_text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(Title)
plt.show()
Positive Reviews
Python
positive = train_df[train_df['sentiment']==1]['Cleaned_sentence'].tolist()
generate_wordcloud(positive,'Positive Review')
Output:
Positive Reviews WordCloundNegative Reviews
Python
negative = train_df[train_df['sentiment']==0]['Cleaned_sentence'].tolist()
generate_wordcloud(negative,'Negative Review')
Output:
Negative Reviews WordCloudSeparate input text and target sentiment of both train and test
Python
# Training data
#Reviews = "[CLS] " +train_df['Cleaned_sentence'] + "[SEP]"
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']
# Test data
#test_reviews = "[CLS] " +test_df['Cleaned_sentence'] + "[SEP]"
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']
Split TEST data into test and validation
Python
x_val, x_test, y_val, y_test = train_test_split(test_reviews,
test_targets,
test_size=0.5,
stratify = test_targets)
Step 4: Tokenization & Encoding
BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model's input format. Let's understand some of the key features of the BERT tokenization model.
- BERT tokenizer splits the words into subwords or workpieces. For example, the word "geeksforgeeks" can be split into "geeks" "##for", and"##geeks". The "##" prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
- BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
- [CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
- [SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
- [MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
- BERT tokenizer gives their components as outputs:
- input_ids: The numerical identifiers of the vocabulary tokens
- token_type_ids: It identifies which segment or sentence each token belongs to.
- attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.
Load the pre-trained BERT tokenizer
Python
#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
Apply the BERT tokenization in training, testing and validation dataset
Python
max_len= 128
# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
padding=True,
truncation=True,
max_length = max_len,
return_tensors='tf')
X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(),
padding=True,
truncation=True,
max_length = max_len,
return_tensors='tf')
X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(),
padding=True,
truncation=True,
max_length = max_len,
return_tensors='tf')
Check the encoded dataset
Python
k = 0
print('Training Comments -->>',Reviews[k])
print('\nInput Ids -->>\n',X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n',tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n',X_train_encoded['attention_mask'][k])
print('\nLabels -->>',Target[k])
Output:
Training Comments -->> When I rented this movie, I had very low expectationsbut when I saw it, I realized that the movie was less a lot less than what I expected The actors were bad the doctor's wife was one of the worst, the story was so stupidit could work for a Disney movie except for the murders, but this one is not a comedy, it is a laughable masterpiece of stupidity The title is well chosen except for one thing they could add stupid movie after Dead Husbands I give it 0 and a half out of 5
Input Ids -->>
tf.Tensor(
[ 101 2043 1045 12524 2023 3185 1010 1045 2018 2200 2659 10908
8569 2102 2043 1045 2387 2009 1010 1045 3651 2008 1996 3185
2001 2625 1037 2843 2625 2084 2054 1045 3517 1996 5889 2020
2919 1996 3460 1005 1055 2564 2001 2028 1997 1996 5409 1010
1996 2466 2001 2061 5236 4183 2071 2147 2005 1037 6373 3185
3272 2005 1996 9916 1010 2021 2023 2028 2003 2025 1037 4038
1010 2009 2003 1037 4756 3085 17743 1997 28072 1996 2516 2003
2092 4217 3272 2005 2028 2518 2027 2071 5587 5236 3185 2044
2757 19089 1045 2507 2009 1014 1998 1037 2431 2041 1997 1019
102 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Decoded Ids -->>
[CLS] when i rented this movie, i had very low expectationsbut when i saw it, i realized that the movie was less a lot less than what i expected the actors were bad the doctor's wife was one of the worst, the story was so stupidit could work for a disney movie except for the murders, but this one is not a comedy, it is a laughable masterpiece of stupidity the title is well chosen except for one thing they could add stupid movie after dead husbands i give it 0 and a half out of 5 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Attention Mask -->>
tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)
Labels -->> 0
Step 5: Build the classification model
Lad the model
Python
# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
Output:
model.safetensors: 100% ------------------ 440M/440M [00:07<00:00, 114MB/s]
All PyTorch model weights were used when initializing TFBertForSequenceClassification.
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able t
If the task at hand is similar to the one on which the checkpoint model was trained, we can use TFBertForSequenceClassification to provide predictions without further training.
Compile the model
Python
# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
Train the model
Python
# Step 5: Train the model
history = model.fit(
[X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
Target,
validation_data=(
[X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
batch_size=32,
epochs=3
)
Output:
Epoch 1/3
782/782 [==============================] - 808s 980ms/step - loss: 0.3348 - accuracy: 0.8480 - val_loss: 0.2891 - val_accuracy: 0.8764
Epoch 2/3
782/782 [==============================] - 765s 979ms/step - loss: 0.1963 - accuracy: 0.9238 - val_loss: 0.2984 - val_accuracy: 0.8906
Epoch 3/3
782/782 [==============================] - 764s 978ms/step - loss: 0.1007 - accuracy: 0.9632 - val_loss: 0.3652 - val_accuracy: 0.8816
Step 6:Evaluate the model
Python
#Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')
Output:
391/391 [==============================] - 106s 271ms/step - loss: 0.3560 - accuracy: 0.8798
Test loss: 0.3560144007205963, Test accuracy: 0.8797600269317627
Save the model and tokenizer to the local folder
Python
path = '/content'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')
# Save model
model.save_pretrained(path +'/Model')
# This code is modified by Susobhan Akhuli
Load the model and tokenizer from the local folder
Python
# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')
# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')
Predict the sentiment of the test dataset
Python
pred = bert_model.predict(
[X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])
# pred is of type TFSequenceClassifierOutput
logits = pred.logits
# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)
# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()
label = {
1: 'positive',
0: 'Negative'
}
# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]
print('Predicted Label :', pred_labels[:10])
print('Actual Label :', Actual[:10])
Output:
391/391 [==============================] - 108s 270ms/step
Predicted Label : ['positive', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'Negative', 'Negative']
Actual Label : ['positive', 'Negative', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'Negative', 'Negative']
Classification Report
Python
print("Classification Report: \n", classification_report(Actual, pred_labels))
Output:
Classification Report:
precision recall f1-score support
Negative 0.87 0.90 0.88 6250
positive 0.90 0.86 0.88 6250
accuracy 0.88 12500
macro avg 0.88 0.88 0.88 12500
weighted avg 0.88 0.88 0.88 12500
Step 7: Prediction with user inputs
Python
def Get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
# Convert Review to a list if it's not already a list
if not isinstance(Review, list):
Review = [Review]
Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
padding=True,
truncation=True,
max_length=128,
return_tensors='tf').values()
prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])
# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(prediction.logits, axis=1)
# Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
return pred_labels
Let's predict with our own review
Python
Review ='''Bahubali is a blockbuster Indian movie that was released in 2015.
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love.
The movie has received rave reviews from critics and audiences alike for its stunning visuals,
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review)
Output:
1/1 [==============================] - 3s 3s/step
['positive']
You can download the source code: Sentiment Classification Using BERT
Similar Reads
Natural Language Processing (NLP) Tutorial Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that helps machines to understand and process human languages either in text or audio form. It is used across a variety of applications from speech recognition to language translation and text summarization.Natural Languag
5 min read
Introduction to NLP
Natural Language Processing (NLP) - OverviewNatural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies. It helps computers understand, process and create human language in a way that makes sense and is useful. With the growing amount of text data from social media, websites and ot
9 min read
NLP vs NLU vs NLGNatural Language Processing(NLP) is a subset of Artificial intelligence which involves communication between a human and a machine using a natural language than a coded or byte language. It provides the ability to give instructions to machines in a more easy and efficient manner. Natural Language Un
3 min read
Applications of NLPAmong the thousands and thousands of species in this world, solely homo sapiens are successful in spoken language. From cave drawings to internet communication, we have come a lengthy way! As we are progressing in the direction of Artificial Intelligence, it only appears logical to impart the bots t
6 min read
Why is NLP important?Natural language processing (NLP) is vital in efficiently and comprehensively analyzing text and speech data. It can navigate the variations in dialects, slang, and grammatical inconsistencies typical of everyday conversations. Table of Content Understanding Natural Language ProcessingReasons Why NL
6 min read
Phases of Natural Language Processing (NLP)Natural Language Processing (NLP) helps computers to understand, analyze and interact with human language. It involves a series of phases that work together to process language and each phase helps in understanding structure and meaning of human language. In this article, we will understand these ph
7 min read
The Future of Natural Language Processing: Trends and InnovationsThere are no reasons why today's world is thrilled to see innovations like ChatGPT and GPT/ NLP(Natural Language Processing) deployments, which is known as the defining moment of the history of technology where we can finally create a machine that can mimic human reaction. If someone would have told
7 min read
Libraries for NLP
Text Normalization in NLP
Normalizing Textual Data with PythonIn this article, we will learn How to Normalizing Textual Data with Python. Let's discuss some concepts : Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.Text normal
7 min read
Regex Tutorial - How to write Regular Expressions?A regular expression (regex) is a sequence of characters that define a search pattern. Here's how to write regular expressions: Start by understanding the special characters used in regex, such as ".", "*", "+", "?", and more.Choose a programming language or tool that supports regex, such as Python,
6 min read
Tokenization in NLPTokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's u
8 min read
Python | Lemmatization with NLTKLemmatization is an important text pre-processing technique in Natural Language Processing (NLP) that reduces words to their base form known as a "lemma." For example, the lemma of "running" is "run" and "better" becomes "good." Unlike stemming which simply removes prefixes or suffixes, it considers
6 min read
Introduction to StemmingStemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes. This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.In NLP, stemming simplifies w
6 min read
Removing stop words with NLTK in PythonNatural language processing tasks often involve filtering out commonly occurring words that provide no or very little semantic value to text analysis. These words are known as stopwords include articles, prepositions and pronouns like "the", "and", "is" and "in." While they seem insignificant, prope
5 min read
POS(Parts-Of-Speech) Tagging in NLPParts of Speech (PoS) tagging is a core task in NLP, It gives each word a grammatical category such as nouns, verbs, adjectives and adverbs. Through better understanding of phrase structure and semantics, this technique makes it possible for machines to study human language more accurately. PoS tagg
7 min read
Text Representation and Embedding Techniques
NLP Deep Learning Techniques
NLP Projects and Practice
Sentiment Analysis with an Recurrent Neural Networks (RNN)Recurrent Neural Networks (RNNs) are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews
5 min read
Text Generation using Recurrent Long Short Term Memory NetworkLSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words
4 min read
Machine Translation with Transformer in PythonMachine translation means converting text from one language into another. Tools like Google Translate use this technology. Many translation systems use transformer models which are good at understanding the meaning of sentences. In this article, we will see how to fine-tune a Transformer model from
6 min read
Building a Rule-Based Chatbot with Natural Language ProcessingA rule-based chatbot follows a set of predefined rules or patterns to match user input and generate an appropriate response. The chatbot canât understand or process input beyond these rules and relies on exact matches making it ideal for handling repetitive tasks or specific queries.Pattern Matching
4 min read
Text Classification using scikit-learn in NLPThe purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit,
5 min read
Text Summarization using HuggingFace ModelText summarization involves reducing a document to its most essential content. The aim is to generate summaries that are concise and retain the original meaning. Summarization plays an important role in many real-world applications such as digesting long articles, summarizing legal contracts, highli
4 min read
Advanced Natural Language Processing Interview QuestionNatural Language Processing (NLP) is a rapidly evolving field at the intersection of computer science and linguistics. As companies increasingly leverage NLP technologies, the demand for skilled professionals in this area has surged. Whether preparing for a job interview or looking to brush up on yo
9 min read