Exp 10 Sentiment Analysis BERT
Exp 10 Sentiment Analysis BERT
[]:
pip install transformers
[]:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: transformers in /usr/local/lib/python3.9/dist-packages (4.28.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (0.13.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (23.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.9/dist-packages (from transformers)
(0.13.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from transformers) (3.11.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (2022.10.31)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (1.22.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from transformers) (2.27.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.9/dist-packages (from transformers) (4.65.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from huggingface-
hub<1.0,>=0.11.0->transformers) (4.5.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->transformers)
(2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->transformers)
(1.26.15)
we will now load the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures.
Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.
[]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
[]:
All model checkpoint layers were used when initializing TFBertForSequenceClassification.
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly
initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[]:
model.summary()
[]:
Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 109482240
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
Here are the results. We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:
Now that we have our model, let’s create our input sequences from the IMDB reviews dataset:
IMDB Dataset IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB.
The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative.
It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning.
Initial Imports We will first have two imports: TensorFlow and Pandas.
[]:
import tensorflow as tf
import pandas as pd
Get the Data from the Stanford Repo Then, we can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:
[]:
URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz",
origin=URL,
untar=True,
cache_dir='.',
cache_subdir='')
Remove Unlabeled Reviews To remove the unlabeled reviews, we need the following operations.
[]:
# The shutil module offers a number of high-level
# operations on files and collections of files.
import os
import shutil
# Create main directory path ("/aclImdb")
main_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
# Create sub directory path ("/aclImdb/train")
train_dir = os.path.join(main_dir, 'train')
# Remove unsup folder since this is a supervised learning task
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
# View the final train folder
print(os.listdir(train_dir))
[]:
['unsupBow.feat', 'urls_unsup.txt', 'urls_neg.txt', 'neg', 'labeledBow.feat', 'pos', 'urls_pos.txt']
Now that we have our data cleaned and prepared, we can create text_dataset_from_directory with the following lines.
I want to process the entire data in a single batch. That’s why I selected a very large batch size:
[]:
# We create a training dataset and a validation
# dataset from our "aclImdb/train" directory with a 80/20 split.
train = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=30000, validation_split=0.2,
subset='training', seed=123)
test = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train', batch_size=30000, validation_split=0.2,
subset='validation', seed=123)
[]:
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Now we have our basic train and test datasets, I want to prepare them for our BERT model.
To make it more comprehensible, I will create a pandas dataframe from our TensorFlow dataset object.
The following code converts our train Dataset object to train pandas dataframe:
[]:
for i in train.take(1):
train_feat = i[0].numpy()
train_lab = i[1].numpy()
I will do the same operations for the test dataset with the following lines:
[]:
for j in test.take(1):
test_feat = j[0].numpy()
test_lab = j[1].numpy()
Creating Input Sequences We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model.
We will take advantage of the InputExample function that helps us to create sequences from our dataset.
[]:
InputExample(guid=None,
text_a = "Hello, world",
text_b = None,
label = 1)
[]:
InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)
1 — convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.
2 — convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input
dataset that we can feed to the model.
[]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN):
train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[DATA_COLUMN],
text_b = None,
label = x[LABEL_COLUMN]), axis = 1)
validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
text_a = x[DATA_COLUMN],
text_b = None,
label = x[LABEL_COLUMN]), axis = 1)
for e in examples:
# Documentation is really strong for this method, so please take a look at it
input_dict = tokenizer.encode_plus(
e.text_a,
add_special_tokens=True,
max_length=max_length, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
truncation=True
)
features.append(
InputFeatures(
input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
)
)
def gen():
for f in features:
yield (
{
"input_ids": f.input_ids,
"attention_mask": f.attention_mask,
"token_type_ids": f.token_type_ids,
},
f.label,
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
)
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'
We can call the functions we created above with the following lines:
[]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)
[]:
/usr/local/lib/python3.9/dist-packages/transformers/tokenization_utils_base.py:2354: FutureWarning: The `pad_to_max_length` argument
is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in
the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g.
`max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).
warnings.warn(
Our dataset containing processed input sequences are ready to be fed to the model.
[]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
[]:
Epoch 1/2
611/Unknown - 590s 874ms/step - loss: 0.3459 - accuracy: 0.8454
Making Predictions I created a list of 10 reviews . few are positive reviews, and few are clearly negative.
[]:
pred_sentences = ['This was an awesome movie. I watch it twice my time watching this beautiful movie if I have known it was this good'
'One of the worst movies of all time. I cannot believe I wasted two hours of my life for this movie',
'Avatar The Way Of Water movie review: Avatar 2 is just stunning in the parts it skims along the water, dives deep,
'After 11 years, the Jackass crew is back for another crusade.',
'the movie is not so good',
'i liked the movie but i dont recommend it to anyone',
'its just one time watch',
'the movie was very lengthy not recommended',
'the movie was horrible and disgusting',
'S. S. Rajamoulis magnum opus epic war saga set new high for Indian Cinema across globe. Action sequences war scenes
len(pred_sentences)
We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We
can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The
following lines do all of these said operations:
[]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
print(pred_sentences[i], ": \n", labels[label[i]])