0% found this document useful (0 votes)
18 views10 pages

NLP - (1) (1) .Ipynb - Colab

Uploaded by

mehwish mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

NLP - (1) (1) .Ipynb - Colab

Uploaded by

mehwish mirza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

4/25/24, 8:28 PM nlp_(1) (1).

ipynb - Colab

Start coding or generate with AI.

keyboard_arrow_down Import Packages


import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, AdamWeightDecay
from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertConfig
import time
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, roc_curve, auc, confus
import matplotlib.pyplot as plt
import numpy as np

keyboard_arrow_down Data Preprocessing


Initially, the dataset is loaded from a CSV file, ignoring any problematic lines, and rows with missing values are dropped to maintain data
integrity. The 'class_int' column is converted to an integer type, and the 'tweet_text' entries are converted to strings. One-hot encoding is applied
to transform categorical data into a binary format. For text preprocessing, the script removes URLs, placeholders, non-word characters,
mentions, and hashtags; the text is then tokenized, stemmed, and lemmatized using NLTK's resources, which are downloaded beforehand.
These preprocessed tweets are stored in a new column, and boolean values in the dataset are replaced with numeric equivalents, finalizing the
data for subsequent analytical or predictive tasks.

# Load dataset while ignoring errors in specific lines


df = pd.read_csv("dataset_final.csv", encoding='latin1', on_bad_lines = 'skip')

df.isnull().sum()

event 0
tweet_text 0
class_int 0
tweet_text_tokenize 0
tweet_stem 0
tweet_lemma 69
dtype: int64

df = df.dropna(axis=0)

df.isnull().sum()

event 0
tweet_text 0
class_int 0
tweet_text_tokenize 0
tweet_stem 0
tweet_lemma 0
dtype: int64

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null object
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 1/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
5 tweet_lemma 74141 non-null object
dtypes: object(6)
memory usage: 4.0+ MB

df.class_int = df.class_int.astype(int)

# Convert all values to string type


df['tweet_text'] = df['tweet_text'].apply(str)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null int64
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object
5 tweet_lemma 74141 non-null object
dtypes: int64(1), object(5)
memory usage: 4.0+ MB

df.head()

event tweet_text class_int tweet_text_tokenize tweet_stem tweet_lemma

armed with a ['arm', ['arm',


hurricane chainsaw and ['armed', 'chainsaw', 'chainsaw', 'chainsaw',
0 6
irma a charitable 'charitable', 'spirit', ... 'charit', 'spirit', 'charitable',
spirit ... 'siste... 'spirit', 's...

stormaileen is
['stormaileen', ['stormaileen',
hurricane bringing rain ['stormaileen', 'bringing',
1 0 'bring', 'rain', 'bring', 'rain',
irma in across the 'rain', 'across', ...
'across', 'we... 'across', 'we...
wes...

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null int64
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object
5 tweet_lemma 74141 non-null object
dtypes: int64(1), object(5)
memory usage: 4.0+ MB

df.class_int.value_counts()

class_int
8 20832
6 11834
9 8479
2 8040
3 6745
5 6128
0 5296
1 3997
7 2592
4 198
Name: count, dtype: int64

# Perform one-hot encoding


df = pd.get_dummies(df, columns=['class_int'])

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 2/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')

# Initialize Porter stemmer and WordNet lemmatizer


stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)

# Remove Placeholders
text = re.sub(r'video|images|html', '', text, flags=re.IGNORECASE)

# Remove non-word characters


text = re.sub(r'\W+', ' ', text)

# Remove Mentions and Hashtags


text = re.sub(r'@[^\s]+|#\S+', '', text)

# Tokenize Words
tokens = word_tokenize(text)

# Stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]

# Join tokens back into a string


preprocessed_text = ' '.join(lemmatized_tokens)

return preprocessed_text

# Apply preprocessing to the tweet_text column


df['preprocessed_tweet'] = df['tweet_text'].apply(preprocess_text)

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!

# Replace True with 1 and False with 0


df = df.replace({True: 1, False: 0})

df.head()

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 3/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab

event tweet_text tweet_text_tokenize tweet_stem tweet_lemma class_int_0 cla

armed with ['arm',


['arm',
a chainsaw 'chainsaw',
hurricane ['armed', 'chainsaw', 'chainsaw',
0 and a 'charit', 0
irma 'charitable', 'spirit', ... 'charitable',
charitable 'spirit',
'spirit', 's...
spirit ... 'siste...

stormaileen
['stormaileen', ['stormaileen',
is bringing ['stormaileen',
hurricane 'bring', 'rain', 'bring', 'rain',
1 rain in 'bringing', 'rain', 1
irma 'across', 'across',
across the 'across', ...
'we... 'we...
wes...

tonight in
['tonight', ['tonight',
our service
hurricane ['tonight', 'service', 'servic', 'service',
2 we will be 0
irma 'collecting', 'followin... 'collect', 'collect',
collecting
'follow', 'it... 'follow', 'i...
t...

its not
['rain', 'yet', ['rain', 'yet',
raining yet
hurricane ['raining', 'yet', 'know', 'know', 'get', 'know', 'get',
3 but we 0
irma 'getting', 'ready',... 'readi', 'ready',
know you
'irma'... 'irma'...
are gettin...

irma update
['irma', ['irma',
2 already
hurricane ['irma', 'update', '2', 'updat', '2', 'update', '2',
4 10 people 0
irma 'already', '10', 'peop... 'alreadi', '10', 'already', '10',
are dead in
'peopl... 'peop...
st...

keyboard_arrow_down Model Loading and Test split


This code processes a dataset of tweets for machine learning by splitting it into training and testing sets with a 20% test size for model
evaluation. It utilizes a BERT tokenizer to prepare the data for input into a TinyBERT model, adapting the model's final layer to output ten classes
for multi-class classification. The model is compiled using the Adam optimizer and categorical cross-entropy loss. Finally, both the input
encodings and labels are converted into TensorFlow tensors, setting the stage for model training and evaluation.

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 4/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Initialize the tokenizer


tokenizer = AutoTokenizer.from_pretrained("google/bert_uncased_L-2_H-128_A-2")

# Tokenize the text


train_encodings = tokenizer(train_df['preprocessed_tweet'].tolist(), padding=True, truncation=True, return_tensors='tf')
test_encodings = tokenizer(test_df['preprocessed_tweet'].tolist(), padding=True, truncation=True, return_tensors='tf')

# Define labels
train_labels = train_df[['class_int_0', 'class_int_1', 'class_int_2', 'class_int_3', 'class_int_4',
'class_int_5', 'class_int_6', 'class_int_7', 'class_int_8', 'class_int_9']]
test_labels = test_df[['class_int_0', 'class_int_1', 'class_int_2', 'class_int_3', 'class_int_4',
'class_int_5', 'class_int_6', 'class_int_7', 'class_int_8', 'class_int_9']]

# Ensure labels are in numeric format


train_labels = train_labels.astype(int)
test_labels = test_labels.astype(int)

# Define number of output classes


num_classes = 10

# Load TinyBERT model


model = TFBertForSequenceClassification.from_pretrained("google/bert_uncased_L-2_H-128_A-2")

# Define number of output classes


num_classes = 10

# Modify the last layer for your specific task


config = BertConfig.from_pretrained("google/bert_uncased_L-2_H-128_A-2", num_labels=num_classes)
model.bert.config = config
model.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')

# Compile the model


model.compile(optimizer= 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])

# Convert train_encodings and test_encodings to TensorFlow tensors


train_input_ids = tf.convert_to_tensor(train_encodings['input_ids'])
train_attention_mask = tf.convert_to_tensor(train_encodings['attention_mask'])
test_input_ids = tf.convert_to_tensor(test_encodings['input_ids'])
test_attention_mask = tf.convert_to_tensor(test_encodings['attention_mask'])

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secre
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncat
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly in
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

keyboard_arrow_down Model Training


This code executes the training and evaluation of a TinyBERT model, timing the entire training process using Python's time.time() to capture
start and end times. Training occurs over 10 epochs with a batch size of 32, using both training and validation datasets. After training, the script
prints a model summary and calculates the total number of trainable parameters. It then evaluates the model on the test set to measure loss
and accuracy, also timing this process for performance analysis. The output includes training and testing times, the number of epochs, and the
total trainable parameters, offering a comprehensive view of the model's training performance and efficiency.

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 5/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Start time for training
start_time = time.time()

# Training
history = model.fit(
(train_input_ids, train_attention_mask),
train_labels,
validation_data=((test_input_ids, test_attention_mask), test_labels),
epochs=10,
batch_size=32
)

# End time for training


end_time = time.time()

# Training time
training_time = end_time - start_time

# Model summary
model.summary()

# Number of parameters
num_parameters = model.count_params()

# Testing time
start_time = time.time()
test_loss, test_accuracy = model.evaluate((test_input_ids, test_attention_mask), test_labels)
end_time = time.time()
testing_time = end_time - start_time

print("\nAdditional Information:")
print("Number of Epochs:", len(history.history['loss']))
print("Number of Parameters:", num_parameters)
print("Training Time:", training_time, "seconds")
print("Testing Time:", testing_time, "seconds")

Epoch 1/10
WARNING:tensorflow:AutoGraph could not transform <function infer_framework at 0x7dbe040bab90> and will run it as-is.
Cause: for/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function infer_framework at 0x7dbe040bab90> and will run it as-is.
Cause: for/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
1854/1854 [==============================] - 89s 32ms/step - loss: 0.9405 - accuracy: 0.6824 - val_loss: 0.7850 - val_accuracy: 0.7323
Epoch 2/10
1854/1854 [==============================] - 39s 21ms/step - loss: 0.7048 - accuracy: 0.7539 - val_loss: 0.7515 - val_accuracy: 0.7345
Epoch 3/10
1854/1854 [==============================] - 38s 21ms/step - loss: 0.5930 - accuracy: 0.7931 - val_loss: 0.7791 - val_accuracy: 0.7373
Epoch 4/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.5089 - accuracy: 0.8242 - val_loss: 0.8741 - val_accuracy: 0.7210
Epoch 5/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.4352 - accuracy: 0.8512 - val_loss: 0.8978 - val_accuracy: 0.7179
Epoch 6/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3828 - accuracy: 0.8688 - val_loss: 0.9621 - val_accuracy: 0.7060
Epoch 7/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3397 - accuracy: 0.8845 - val_loss: 1.0341 - val_accuracy: 0.7017
Epoch 8/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3045 - accuracy: 0.8969 - val_loss: 1.0586 - val_accuracy: 0.7031
Epoch 9/10
1854/1854 [==============================] - 36s 20ms/step - loss: 0.2711 - accuracy: 0.9087 - val_loss: 1.1848 - val_accuracy: 0.6949
Epoch 10/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.2566 - accuracy: 0.9138 - val_loss: 1.1648 - val_accuracy: 0.6988
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 4385920

dropout_7 (Dropout) multiple 0

=================================================================
Total params: 4387210 (16.74 MB)
Trainable params: 4387210 (16.74 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
464/464 [==============================] - 3s 7ms/step - loss: 1.1648 - accuracy: 0.6988

Additional Information:
Number of Epochs: 10
Number of Parameters: 4387210
Training Time: 424.1439175605774 seconds

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 6/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
Testing Time: 3.4680261611938477 seconds

keyboard_arrow_down Model Evaluation


This code below demonstrates comprehensive evaluation and visualization processes for a trained TinyBERT model on a text classification
task. Initially, it plots the training and validation loss and accuracy over epochs, providing visual insight into the model's learning and
generalization capabilities across training sessions. It then computes predicted probabilities and applies a softmax function to transform logits
into actual probabilities. The script calculates the ROC curve and the Area Under the Curve (AUC) for each class, offering a graphical
representation of model performance across different thresholds. Predicted probabilities are further used to derive predictions, which are then
compared against the true labels to compute accuracy, F1 score, recall, and precision, offering a holistic view of the model's predictive
accuracy. A confusion matrix is plotted to visually assess the model's performance in distinguishing between different classes. Finally, a
detailed classification report is generated, providing precision, recall, F1-score, and support for each class, thus summarizing the model’s
performance across various metrics. This series of evaluations and visualizations helps in understanding the model's strengths and
weaknesses in classifying text into specific categories.

train_loss = history.history['loss']
val_loss = history.history['val_loss']
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
epochs = range(1, len(train_loss) + 1)

# Plot loss
plt.figure(figsize=(12, 6))
plt.plot(epochs, train_loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# Plot accuracy
plt.figure(figsize=(12, 6))
plt.plot(epochs, train_accuracy, 'b', label='Training accuracy')
plt.plot(epochs, val_accuracy, 'r', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 7/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab

label_names = [
'caution_and_advice',
'displaced_people_and_evacuations',
'infrastructure_and_utility_damage',
'injured_or_dead_people',
'missing_or_found_people',
'not_humanitarian',
'other_relevant_information',
'requests_or_urgent_needs',
'rescue_volunteering_or_donation_effort',
'sympathy_and_support'
]

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 8/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Compute predicted probabilities
test_probs_array = model.predict((test_input_ids, test_attention_mask))

# Ensure test_labels is a numpy array


test_labels_array = test_labels.to_numpy()

# Compute predicted probabilities


test_probs_array = model.predict((test_input_ids, test_attention_mask))

# Get logits from TFSequenceClassifierOutput


logits = test_probs_array.logits

# Apply softmax to get probabilities


test_probs_array = tf.nn.softmax(logits, axis=-1)

# Convert probabilities to numpy array


test_probs_array = test_probs_array.numpy()

# Compute ROC curve and ROC area for each class


fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(num_classes):
fpr[i], tpr[i], _ = roc_curve(test_labels_array[:, i], test_probs_array[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])

# Plot AUROC for each class


plt.figure(figsize=(10, 8))
for i in range(num_classes):
plt.plot(fpr[i], tpr[i], label='Class {} (AUC = {:.2f})'.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--')


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 9/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab

464/464 [==============================] - 3s 7ms/step


test_probs = model.predict((test_input_ids,
464/464 test_attention_mask))
[==============================] - 3s 7ms/step

# Extract probabilities from the logits


test_probs_array = tf.nn.softmax(test_probs.logits, axis=-1)

# Convert probabilities to numpy array


test_probs_numpy = test_probs_array.numpy()

# Calculate predicted labels


test_preds = np.argmax(test_probs_numpy, axis=-1)

464/464 [==============================] - 3s 7ms/step

# Calculate hyperparam - Train/Validation Loss


train_loss = history.history['loss']
val_loss = history.history['val_loss']

accuracy = accuracy_score(np.argmax(test_labels.to_numpy(), axis=1), test_preds)


f1 = f1_score(np.argmax(test_labels.to_numpy(), axis=1), test_preds, average='weighted')
recall = recall_score(np.argmax(test_labels.to_numpy(), axis=1), test_preds, average='weighted')
precision = precision_score(np.argmax(test_labels.to_numpy(), axis=1), test_preds, average='weighted')

# Confusion Matrix
conf_matrix = confusion_matrix(np.argmax(test_labels.to_numpy(), axis=1), test_preds)

# Print evaluation metrics


print("\nEvaluation Metrics:")
print("Training Time:", training_time, "seconds")
print("Testing Time:", testing_time, "seconds")
print("Hyperparam - Train Loss:", train_loss)
print("Hyperparam - Validation Loss:", val_loss)
print("Evaluation - Test Loss:", test_loss)
print("Accuracy:", accuracy)
print("F1 Score:", f1)
print("Recall:", recall)
print("Precision:", precision)

Evaluation Metrics:
Training Time: 424.1439175605774 seconds
Testing Time: 3.4680261611938477 seconds
Hyperparam - Train Loss: [0.9405367970466614, 0.7047996520996094, 0.5930291414260864, 0.508854866027832, 0.43519365787506104, 0.38283029
Hyperparam - Validation Loss: [0.7849678993225098, 0.7515356540679932, 0.779116153717041, 0.8741132616996765, 0.8977705836296082, 0.9620
Evaluation - Test Loss: 1.1647629737854004
Accuracy: 0.6988333670510486
F1 Score: 0.6929945328081929
Recall: 0.6988333670510486
Precision: 0.6910538452882607

# Calculate confusion matrix


conf_matrix = confusion_matrix(np.argmax(test_labels.to_numpy(), axis=1), test_preds)

# Plot confusion matrix


plt.figure(figsize=(10, 8))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='d', xticklabels=label_names, yticklabels=label_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted label')

https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 10/10

You might also like