NLP - (1) (1) .Ipynb - Colab
NLP - (1) (1) .Ipynb - Colab
ipynb - Colab
df.isnull().sum()
event 0
tweet_text 0
class_int 0
tweet_text_tokenize 0
tweet_stem 0
tweet_lemma 69
dtype: int64
df = df.dropna(axis=0)
df.isnull().sum()
event 0
tweet_text 0
class_int 0
tweet_text_tokenize 0
tweet_stem 0
tweet_lemma 0
dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null object
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 1/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
5 tweet_lemma 74141 non-null object
dtypes: object(6)
memory usage: 4.0+ MB
df.class_int = df.class_int.astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null int64
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object
5 tweet_lemma 74141 non-null object
dtypes: int64(1), object(5)
memory usage: 4.0+ MB
df.head()
stormaileen is
['stormaileen', ['stormaileen',
hurricane bringing rain ['stormaileen', 'bringing',
1 0 'bring', 'rain', 'bring', 'rain',
irma in across the 'rain', 'across', ...
'across', 'we... 'across', 'we...
wes...
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 74141 entries, 0 to 74209
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event 74141 non-null object
1 tweet_text 74141 non-null object
2 class_int 74141 non-null int64
3 tweet_text_tokenize 74141 non-null object
4 tweet_stem 74141 non-null object
5 tweet_lemma 74141 non-null object
dtypes: int64(1), object(5)
memory usage: 4.0+ MB
df.class_int.value_counts()
class_int
8 20832
6 11834
9 8479
2 8040
3 6745
5 6128
0 5296
1 3997
7 2592
4 198
Name: count, dtype: int64
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 2/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
def preprocess_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove Placeholders
text = re.sub(r'video|images|html', '', text, flags=re.IGNORECASE)
# Tokenize Words
tokens = word_tokenize(text)
# Stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens]
# Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]
return preprocessed_text
df.head()
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 3/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
stormaileen
['stormaileen', ['stormaileen',
is bringing ['stormaileen',
hurricane 'bring', 'rain', 'bring', 'rain',
1 rain in 'bringing', 'rain', 1
irma 'across', 'across',
across the 'across', ...
'we... 'we...
wes...
tonight in
['tonight', ['tonight',
our service
hurricane ['tonight', 'service', 'servic', 'service',
2 we will be 0
irma 'collecting', 'followin... 'collect', 'collect',
collecting
'follow', 'it... 'follow', 'i...
t...
its not
['rain', 'yet', ['rain', 'yet',
raining yet
hurricane ['raining', 'yet', 'know', 'know', 'get', 'know', 'get',
3 but we 0
irma 'getting', 'ready',... 'readi', 'ready',
know you
'irma'... 'irma'...
are gettin...
irma update
['irma', ['irma',
2 already
hurricane ['irma', 'update', '2', 'updat', '2', 'update', '2',
4 10 people 0
irma 'already', '10', 'peop... 'alreadi', '10', 'already', '10',
are dead in
'peopl... 'peop...
st...
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 4/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# Define labels
train_labels = train_df[['class_int_0', 'class_int_1', 'class_int_2', 'class_int_3', 'class_int_4',
'class_int_5', 'class_int_6', 'class_int_7', 'class_int_8', 'class_int_9']]
test_labels = test_df[['class_int_0', 'class_int_1', 'class_int_2', 'class_int_3', 'class_int_4',
'class_int_5', 'class_int_6', 'class_int_7', 'class_int_8', 'class_int_9']]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secre
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncat
All PyTorch model weights were used when initializing TFBertForSequenceClassification.
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly in
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 5/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Start time for training
start_time = time.time()
# Training
history = model.fit(
(train_input_ids, train_attention_mask),
train_labels,
validation_data=((test_input_ids, test_attention_mask), test_labels),
epochs=10,
batch_size=32
)
# Training time
training_time = end_time - start_time
# Model summary
model.summary()
# Number of parameters
num_parameters = model.count_params()
# Testing time
start_time = time.time()
test_loss, test_accuracy = model.evaluate((test_input_ids, test_attention_mask), test_labels)
end_time = time.time()
testing_time = end_time - start_time
print("\nAdditional Information:")
print("Number of Epochs:", len(history.history['loss']))
print("Number of Parameters:", num_parameters)
print("Training Time:", training_time, "seconds")
print("Testing Time:", testing_time, "seconds")
Epoch 1/10
WARNING:tensorflow:AutoGraph could not transform <function infer_framework at 0x7dbe040bab90> and will run it as-is.
Cause: for/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function infer_framework at 0x7dbe040bab90> and will run it as-is.
Cause: for/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
1854/1854 [==============================] - 89s 32ms/step - loss: 0.9405 - accuracy: 0.6824 - val_loss: 0.7850 - val_accuracy: 0.7323
Epoch 2/10
1854/1854 [==============================] - 39s 21ms/step - loss: 0.7048 - accuracy: 0.7539 - val_loss: 0.7515 - val_accuracy: 0.7345
Epoch 3/10
1854/1854 [==============================] - 38s 21ms/step - loss: 0.5930 - accuracy: 0.7931 - val_loss: 0.7791 - val_accuracy: 0.7373
Epoch 4/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.5089 - accuracy: 0.8242 - val_loss: 0.8741 - val_accuracy: 0.7210
Epoch 5/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.4352 - accuracy: 0.8512 - val_loss: 0.8978 - val_accuracy: 0.7179
Epoch 6/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3828 - accuracy: 0.8688 - val_loss: 0.9621 - val_accuracy: 0.7060
Epoch 7/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3397 - accuracy: 0.8845 - val_loss: 1.0341 - val_accuracy: 0.7017
Epoch 8/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.3045 - accuracy: 0.8969 - val_loss: 1.0586 - val_accuracy: 0.7031
Epoch 9/10
1854/1854 [==============================] - 36s 20ms/step - loss: 0.2711 - accuracy: 0.9087 - val_loss: 1.1848 - val_accuracy: 0.6949
Epoch 10/10
1854/1854 [==============================] - 37s 20ms/step - loss: 0.2566 - accuracy: 0.9138 - val_loss: 1.1648 - val_accuracy: 0.6988
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 4385920
=================================================================
Total params: 4387210 (16.74 MB)
Trainable params: 4387210 (16.74 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
464/464 [==============================] - 3s 7ms/step - loss: 1.1648 - accuracy: 0.6988
Additional Information:
Number of Epochs: 10
Number of Parameters: 4387210
Training Time: 424.1439175605774 seconds
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 6/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
Testing Time: 3.4680261611938477 seconds
train_loss = history.history['loss']
val_loss = history.history['val_loss']
train_accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
epochs = range(1, len(train_loss) + 1)
# Plot loss
plt.figure(figsize=(12, 6))
plt.plot(epochs, train_loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
# Plot accuracy
plt.figure(figsize=(12, 6))
plt.plot(epochs, train_accuracy, 'b', label='Training accuracy')
plt.plot(epochs, val_accuracy, 'r', label='Validation accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 7/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
label_names = [
'caution_and_advice',
'displaced_people_and_evacuations',
'infrastructure_and_utility_damage',
'injured_or_dead_people',
'missing_or_found_people',
'not_humanitarian',
'other_relevant_information',
'requests_or_urgent_needs',
'rescue_volunteering_or_donation_effort',
'sympathy_and_support'
]
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 8/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Compute predicted probabilities
test_probs_array = model.predict((test_input_ids, test_attention_mask))
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 9/10
4/25/24, 8:28 PM nlp_(1) (1).ipynb - Colab
# Confusion Matrix
conf_matrix = confusion_matrix(np.argmax(test_labels.to_numpy(), axis=1), test_preds)
Evaluation Metrics:
Training Time: 424.1439175605774 seconds
Testing Time: 3.4680261611938477 seconds
Hyperparam - Train Loss: [0.9405367970466614, 0.7047996520996094, 0.5930291414260864, 0.508854866027832, 0.43519365787506104, 0.38283029
Hyperparam - Validation Loss: [0.7849678993225098, 0.7515356540679932, 0.779116153717041, 0.8741132616996765, 0.8977705836296082, 0.9620
Evaluation - Test Loss: 1.1647629737854004
Accuracy: 0.6988333670510486
F1 Score: 0.6929945328081929
Recall: 0.6988333670510486
Precision: 0.6910538452882607
https://colab.research.google.com/drive/1Rd0pOk2BYqncRNgYc0otWVJ339YyvISu#printMode=true 10/10