0% found this document useful (0 votes)
81 views

BERT - Ipynb - Colaboratory

This document summarizes the steps taken to build and train a BERT model for text classification on a SMS spam detection dataset. The key steps include: 1. Preprocessing the SMS spam dataset and splitting it into train, validation and test sets. 2. Loading the pre-trained BERT model and defining a classification architecture with additional layers. 3. Tokenizing and encoding the train, validation and test texts using the BERT tokenizer. 4. Defining the model, loss function, optimizer and training loop to fine-tune the BERT model on the SMS spam detection task for 20 epochs.

Uploaded by

mehtakinjalb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

BERT - Ipynb - Colaboratory

This document summarizes the steps taken to build and train a BERT model for text classification on a SMS spam detection dataset. The key steps include: 1. Preprocessing the SMS spam dataset and splitting it into train, validation and test sets. 2. Loading the pre-trained BERT model and defining a classification architecture with additional layers. 3. Tokenizing and encoding the train, validation and test texts using the BERT tokenizer. 4. Defining the model, loss function, optimizer and training loop to fine-tune the BERT model on the SMS spam detection task for 20 epochs.

Uploaded by

mehtakinjalb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

11/22/23, 4:13 PM BERT.

ipynb - Colaboratory

Subrata Jana week 10 Assignment BERT Model

!pip install transformers


import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast

# specify GPU
#device = torch.device("cuda")

from google.colab import files

uploaded = files.upload()

Choose Files No file chosen Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.
Saving spam.csv to spam.csv

df = pd.read_csv("spam.csv", encoding = 'latin-1')


df.head()

Unnamed: Unnamed: Unnamed:


v1 v2
2 3 4

0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN

1 ham Ok lar... Joking wif u oni... NaN NaN NaN

Free entry in 2 a wkly comp to win FA Cup


2 spam NaN NaN NaN
fina...

U dun say so early hor... U c already then


3 ham NaN NaN NaN
say...

df.dropna(how="any", inplace=True, axis=1)


df.columns = ['label', 'message']
df.head()

label message

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

df['label_num']=df.label.map({'ham':0,'spam':1})
df.head()

label message label_num

0 ham Go until jurong point, crazy.. Available only ... 0

1 ham Ok lar... Joking wif u oni... 0

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1

3 ham U dun say so early hor... U c already then say... 0

4 ham Nah I don't think he goes to usf, he lives aro... 0

# check class distribution


df['label_num'].value_counts(normalize = True)

0 0.865937
1 0.134063
Name: label_num, dtype: float64

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 1/6
11/22/23, 4:13 PM BERT.ipynb - Colaboratory
# split train dataset into train, validation and test sets
train_text, temp_text, train_labels, temp_labels = train_test_split(df['message'], df['label_num'],
random_state=2018,
test_size=0.3,
stratify=df['label_num'])

val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels,


random_state=2018,
test_size=0.5,
stratify=temp_labels)

# import BERT-base pretrained model


bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer


tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

config.json: 100% 570/570 [00:00<00:00, 12.4kB/s]

model.safetensors: 440M/440M [00:03<00:00,

100% 113MB/s]

tokenizer_config.json: 28.0/28.0 [00:00<00:00,

100% 585B/s]

vocab txt: 100% 232k/232k [00:00<00:00 3 66MB/s]

# get length of all the messages in the train set


seq_len = [len(i.split()) for i in train_text]

pd.Series(seq_len).hist(bins = 30)

<Axes: >

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 2/6
11/22/23, 4:13 PM BERT.ipynb - Colaboratory
# tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
train_text.tolist(),
max_length = 25,
pad_to_max_length=True,
truncation=True
)

# tokenize and encode sequences in the validation set


tokens_val = tokenizer.batch_encode_plus(
val_text.tolist(),
max_length = 25,
pad_to_max_length=True,
truncation=True
)

# tokenize and encode sequences in the test set


tokens_test = tokenizer.batch_encode_plus(
test_text.tolist(),
max_length = 25,
pad_to_max_length=True,
truncation=True
)

train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

#define a batch size


batch_size = 32

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training


train_sampler = RandomSampler(train_data)

# dataLoader for train set


train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training


val_sampler = SequentialSampler(val_data)

# dataLoader for validation set


val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

# freeze all the parameters


for param in bert.parameters():
param.requires_grad = False

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 3/6
11/22/23, 4:13 PM BERT.ipynb - Colaboratory
class BERT_Arch(nn.Module):

def __init__(self, bert):


super(BERT_Arch, self).__init__()

self.bert = bert

# dropout layer
self.dropout = nn.Dropout(0.1)

# relu activation function


self.relu = nn.ReLU()

# dense layer 1
self.fc1 = nn.Linear(768,512)

# dense layer 2 (Output layer)


self.fc2 = nn.Linear(512,2)

#softmax activation function


self.softmax = nn.LogSoftmax(dim=1)

#define the forward pass


def forward(self, sent_id, mask):

#pass the inputs to the model


_, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)

x = self.fc1(cls_hs)

x = self.relu(x)

x = self.dropout(x)

# output layer
x = self.fc2(x)

# apply softmax activation


x = self.softmax(x)

return x

# pass the pre-trained BERT to our define architecture


model = BERT_Arch(bert)

# optimizer from hugging face transformers


from transformers import AdamW

# define the optimizer


optimizer = AdamW(model.parameters(),lr = 1e-5)

from sklearn.utils.class_weight import compute_class_weight

y = train_labels
classes=np.unique(y)

#compute the class weights


class_weights = compute_class_weight('balanced', classes = classes, y = y)

print('Class Weights:',class_weights)

/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated


warnings.warn(
Class Weights: [0.57743559 3.72848948]

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 4/6
11/22/23, 4:13 PM BERT.ipynb - Colaboratory
# converting list of class weights to a tensor
weights= torch.tensor(class_weights,dtype=torch.float)

# push to GPU
#weights = weights.to(device)

# define the loss function


cross_entropy = nn.NLLLoss(weight=weights)

# number of training epochs


epochs = 20

# function to train the model


def train():

model.train()
total_loss, total_accuracy = 0, 0

# empty list to save model predictions


total_preds=[]

# iterate over batches


for step,batch in enumerate(train_dataloader):

# progress update after every 50 batches.


if step % 50 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))

# push the batch to gpu


#batch = [r.to(device) for r in batch]

sent_id, mask, labels = batch

# clear previously calculated gradients


model.zero_grad()

# get model predictions for the current batch


preds = model(sent_id, mask)

# compute the loss between actual and predicted values


loss = cross_entropy(preds, labels)

# add on to the total loss


total_loss = total_loss + loss.item()

# backward pass to calculate the gradients


loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

# update parameters
optimizer.step()

# model predictions are stored on GPU. So, push it to CPU


preds=preds.detach().cpu().numpy()

# append the model predictions


total_preds.append(preds)

# compute the training loss of the epoch


avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)

#returns the loss and predictions


return avg_loss, total_preds

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 5/6
11/22/23, 4:13 PM BERT.ipynb - Colaboratory
# function for evaluating the model
def evaluate():

print("\nEvaluating...")

# deactivate dropout layers


model.eval()

total_loss, total_accuracy = 0, 0

# empty list to save the model predictions


total_preds = []

# iterate over batches


for step,batch in enumerate(val_dataloader):

# Progress update every 50 batches.


if step % 50 == 0 and not step == 0:

# Calculate elapsed time in minutes.


elapsed = format_time(time.time() - t0)

# Report progress.
print(' Batch {:>5,} of {:>5,}.'.format(step, len(val_dataloader)))

# push the batch to gpu


# batch = [t.to(device) for t in batch]

sent_id, mask, labels = batch

# deactivate autograd
with torch.no_grad():

# model predictions
preds = model(sent_id, mask)

# compute the validation loss between actual and predicted values


loss = cross_entropy(preds,labels)

total_loss = total_loss + loss.item()


# set initial loss to infinite
preds = preds.detach().cpu().numpy()
best_valid_loss = float('inf')

total_preds.append(preds)
#defining epochs
epochs = 20
# compute the validation loss of the epoch
avg_loss = total_loss / len(val_dataloader)
# empty lists to store training and validation loss of each epoch
train_losses=[]
# reshape the predictions in form of (number of samples, no. of classes)
valid_losses=[]
total_preds = np.concatenate(total_preds, axis=0)
#for each epoch
return avg_loss, total_preds
for epoch in range(epochs):

print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))

#train model
train_loss, _ = train()

#evaluate model
valid_loss, _ = evaluate()

#save the best model


if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')

# append training and validation loss


train_losses.append(train_loss)
valid_losses.append(valid_loss)

print(f'\nTraining Loss: {train_loss:.3f}')


print(f'Validation Loss: {valid_loss:.3f}')

#load weights of best model


path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))

https://colab.research.google.com/drive/1EOYF-YXlpoImo-EU8X-N1pNuYAtj27Rw#scrollTo=bjCduruPRkq6&printMode=true 6/6

You might also like