0% found this document useful (0 votes)

6 views

Data Science Project

The document outlines a sentiment analysis project utilizing a dataset of tweets labeled as negative or non-negative. It details the data cleaning process, including removing URLs, punctuation, and lemmatizing the text, followed by visualizations of the data distributions and common words. Additionally, it introduces the ELMo model for generating contextualized word embeddings to enhance the analysis.

Uploaded by

Nandan Vatsyayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Data Science Project

Uploaded by

Nandan Vatsyayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

import pandas as pd

import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
%matplotlib inline

1 represents a negative tweet

0 represents a non-negative tweet.
df=pd.read_csv("/kaggle/input/sentimenttraindataset/
train_2kmZucJ.csv")

wordcloud_mask=np.array(Image.open("/kaggle/input/wordcloud-mask-
collection/twitter.png"))

df.head()

id label tweet
0 1 0 #fingerprint #Pregnancy Test https://goo.gl/h1...
1 2 0 Finally a transparant silicon case ^^ Thanks t...
2 3 0 We love this! Would you go? #talk #makememorie...
3 4 0 I'm wired I know I'm George I was made that wa...
4 5 1 What amazing service! Apple won't even talk to...

df.shape

(7920, 3)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7920 entries, 0 to 7919
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 7920 non-null int64
1 label 7920 non-null int64
2 tweet 7920 non-null object
dtypes: int64(2), object(1)
memory usage: 185.8+ KB

df.isnull().sum()

id 0
label 0
tweet 0
dtype: int64

Clean Train Data

remove URL's from train

df['clean_tweet'] = df['tweet'].apply(lambda x: re.sub(r'http\S+', '',
x))

remove punctuation marks

punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'
df['clean_tweet'] = df['clean_tweet'].apply(lambda x: ''.join(ch for
ch in x if ch not in set(punctuation)))

convert text to lowercase

df['clean_tweet'] = df['clean_tweet'].str.lower()

remove numbers
df['clean_tweet'] = df['clean_tweet'].str.replace("[0-9]", " ")

remove whitespaces
df['clean_tweet'] = df['clean_tweet'].apply(lambda x:'
'.join(x.split()))
lemmatize (normalize) the text
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatization(texts):
output = []
for i in texts:
s = [token.lemma_ for token in nlp(i)]
output.append(' '.join(s))
return output

df['clean_tweet'] = lemmatization(df['clean_tweet'])

df.drop(columns=["id","tweet"],axis=1,inplace=True)

df.head()

label clean_tweet
0 0 fingerprint pregnancy test android app beautif...
1 0 finally a transparant silicon case thank to my...
2 0 we love this would you go talk makememorie unp...
3 0 I be wire I know I be george I be make that wa...
4 1 what amazing service apple will not even talk ...

Clean Test Data

test_data=pd.read_csv("/kaggle/input/test-data/test_oJQbWVk.csv")

test_data.head()

id tweet
0 7921 I hate the new #iphone upgrade. Won't let me d...
1 7922 currently shitting my fucking pants. #apple #i...
2 7923 I'd like to puts some CD-ROMS on my iPad, is t...
3 7924 My ipod is officially dead. I lost all my pict...
4 7925 Been fighting iTunes all night! I only want th...

test_data['clean_tweet'] = test_data['tweet'].apply(lambda x:
re.sub(r'http\S+', '', x))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x:
''.join(ch for ch in x if ch not in set(punctuation)))
test_data['clean_tweet'] = test_data['clean_tweet'].str.lower()
test_data['clean_tweet'] = test_data['clean_tweet'].str.replace("[0-
9]", " ")
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: '
'.join(x.split()))

test_data.drop(columns=["tweet"],axis=1,inplace=True)

test_data.head()
id clean_tweet
0 7921 i hate the new iphone upgrade. won't let me do...
1 7922 currently shitting my fucking pants. apple ima...
2 7923 i'd like to puts some cdroms on my ipad, is th...
3 7924 my ipod is officially dead. i lost all my pict...
4 7925 been fighting itunes all night i only want the...

Visualize Data From Train Data

df.head()

df["label"].value_counts()

label
0 5894
1 2026
Name: count, dtype: int64

plt.figure(figsize=(10,8))
sns.countplot(x="label",data=df,palette="gnuplot")
plt.show()
label_counts = df['label'].value_counts()

plt.figure(figsize=(10,8))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%',
colors=sns.color_palette("gnuplot", len(label_counts)))
plt.title('Label Distribution')
plt.axis("equal")
plt.show()
plt.figure(figsize=(10,8))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%',
colors=sns.color_palette("brg", len(label_counts)),
wedgeprops=dict(width=0.4))
plt.title('Label Distribution')
plt.axis("equal")
plt.show()
Positive Data Length
plt.figure(figsize=(10,8))
positive_len=df[df["label"]==0]["clean_tweet"].str.len()
plt.hist(positive_len, bins=40,label='Positive Data
Length',color="red")
plt.title("Positive Data Length")
plt.legend()
plt.show()
Negative Data Length
plt.figure(figsize=(10,8))
negative_len=df[df["label"]==1]["clean_tweet"].str.len()
plt.hist(negative_len, bins=40,label='Negative Data
Length',color="green")
plt.title("Negative Data Length")
plt.legend()
plt.show()
Compare Positive & Negative
plt.figure(figsize=(10,8))

plt.hist(positive_len, bins=40,label='Ham Data Length',color="red")

plt.hist(negative_len, bins=40,label='Spam Data Length',color="green")
plt.title("Compare Positive & Negative Data Length")
plt.legend()
plt.show()
ALL Data WordCloud
from wordcloud import WordCloud, STOPWORDS
plt.figure(figsize=(15,15))
all_text=" ".join(df['clean_tweet'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,
background_color='black',
max_words=800,colormap="CMRmap",mask=wordcloud_mask).generate(all_text
)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("ALL Data WordCloud")
plt.show()
Positive Data WordCloud
plt.figure(figsize=(15,15))
positive_wordcloud=df[df["label"]==0]
positive_text="
".join(positive_wordcloud['clean_tweet'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,
background_color='black',
max_words=800,colormap="RdYlGn",mask=wordcloud_mask).generate(positive
_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title("Positive Data WordCloud")
plt.axis('off')
plt.show()
Negative Data WordCloud
plt.figure(figsize=(15,15))
negative_wordcloud=df[df["label"]==1]
negative_text="
".join(negative_wordcloud['clean_tweet'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,
background_color='black',
max_words=800,colormap="seismic",mask=wordcloud_mask).generate(negativ
e_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.title("Negative Data WordCloud")
plt.axis('off')
plt.show()
30 Most common Words From All Text
from itertools import chain
from collections import Counter

data_set =df["clean_tweet"].str.split()
all_words = list(chain.from_iterable(data_set))
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word',
'Count'])

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato",

"teal", "sandybrown", "mediumseagreen",
"royalblue", "darkorchid", "darkturquoise", "darkgoldenrod",
"mediumvioletred", "mediumaquamarine",
"lightcoral", "darkslategray", "olivedrab", "dodgerblue",
"indianred", "limegreen", "steelblue",
"darkviolet", "chocolate", "mediumslateblue", "darkgreen",
"orangered", "mediumblue", "peru", "mediumspringgreen"]

plt.figure(figsize=(12, 10))
sns.barplot(x='Count', y='Word', data=df_common_words, palette=colors)
plt.title('30 Most Common Words')
plt.xlabel('Count')
plt.ylabel('Word')
plt.show()

Most Common Words From Positive Text

positive_text = df[df["label"] == 0]
data_set = positive_text["clean_tweet"].str.split()
all_words = [word for sublist in data_set for word in sublist]
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word',
'Count'])

plt.figure(figsize=(12, 10))
sns.barplot(x='Count', y='Word', data=df_common_words,
palette=sns.color_palette("Dark2", n_colors=len(df_common_words)))
plt.title('30 Most Common Positive Words')
plt.xlabel('Count')
plt.ylabel('Word')
plt.show()

Most Common Words From Negative Text

negative_text = df[df["label"] == 1]
data_set = negative_text["clean_tweet"].str.split()
all_words = [word for sublist in data_set for word in sublist]

counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word',
'Count'])

plt.figure(figsize=(12, 8))
sns.barplot(x='Count', y='Word', data=df_common_words,
palette=sns.color_palette("tab20b", n_colors=len(df_common_words)))
plt.title('30 Most Common Negative Words')
plt.xlabel('Count')
plt.ylabel('Word')
plt.show()

Visualize Data From Test Data

plt.figure(figsize=(15,15))
all_text=" ".join(test_data['clean_tweet'].values.tolist())
wordcloud = WordCloud(width=800, height=800,stopwords=STOPWORDS,
background_color='black',
max_words=800,colormap="Spectral",mask=wordcloud_mask).generate(all_te
xt)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("ALL Data WordCloud")
plt.show()

data_set =test_data["clean_tweet"].str.split()
all_words = list(chain.from_iterable(data_set))
counter = Counter(all_words)
common_words = counter.most_common(30)
df_common_words = pd.DataFrame(common_words, columns=['Word',
'Count'])

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato",

plt.figure(figsize=(12, 10))
sns.barplot(x='Count', y='Word', data=df_common_words, palette=colors)
plt.title('30 Most Common Words From Test Data')
plt.xlabel('Count')
plt.ylabel('Word')
plt.show()
Use ELmo Model
What is ELMo?
• No, the ELMo we are referring to isn’t the character from Sesame Street! A classic
example of the importance of context.

• ELMo is a deep contextualized word in vectors or embeddings, often referred to as

ELMo embedding. These word embeddings are helpful in achieving state-of-the-art
(SOTA) results in several NLP tasks:

• NLP scientists globally have started using ELMo for various NLP tasks, both in
research as well as the industry. You must check out the original ELMo research
paper
• I don’t usually ask people to read research papers because they can often come
across as heavy and complex but I’m making an exception for ELMo. This one is a
really cool explanation of how ELMo was designed.

How Does ELMo Works?

• Let’s get an intuition of how ELMo works underneath before we implement it in
Python. Why is this important?

• Well, picture this. You’ve successfully copied the ELMo code from GitHub into
Python and managed to build a model on your custom text data. You get average
results so you need to improve the model. How will you do that if you don’t
understand the architecture of ELMo? What parameters will you tweak if you
haven’t studied about it?

• This line of thought applies to all machine learning algorithms. You need not get into
their derivations but you should always know enough to play around with them and
improve your model.

Now, let’s come back to how ELMo works.

• As I mentioned earlier, ELMo word vectors are computed on top of a two-layer
bidirectional language model (biLM). This biLM model has two layers stacked together.
Each layer has 2 passes — forward pass and backward pass:

• The architecture above uses a character-level convolutional neural network (CNN) to

represent words of a text string into raw word vectors These raw word vectors act as
inputs to the first layer of biLM ELMo enhance the semantic understanding of sentences
The forward pass contains information about a certain word and the context (other
words) before that word
• The backward pass contains information about the word and the context after it This pair
of information, from the forward and backward pass, forms the intermediate word
vectors These intermediate word vectors are fed into the next layer of biLM The final
representation (ELMo) is the weighted sum of the raw word vectors and the 2
intermediate word vectors
import tensorflow_hub as hub
import tensorflow as tf
model_path="/kaggle/input/elmo/tensorflow1/elmo/3/"
elmo = hub.KerasLayer(model_path, trainable=False)

Let’s define a function for doing this:

def elmo_vectors(texts):
return elmo(tf.convert_to_tensor(texts, dtype=tf.string))

Splitting Data into Batches for Training and

Testing
batch_size = 8

list_train = [df[i:i+batch_size] for i in range(0, df.shape[0],

batch_size)]
list_test = [test_data[i:i+batch_size] for i in range(0,
test_data.shape[0], batch_size)]

Generating ELMo Embeddings for Training and

Testing Data
elmo_train_new =
np.concatenate([elmo_vectors(x['clean_tweet'].tolist()) for x in
list_train], axis=0)
elmo_test_new =
np.concatenate([elmo_vectors(x['clean_tweet'].tolist()) for x in
list_test], axis=0)

elmo_train_new = np.expand_dims(elmo_train_new, axis=1)

elmo_test_new = np.expand_dims(elmo_test_new, axis=1)

label_data=df["label"]
Split Data
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test =

train_test_split(elmo_train_new,label_data,
random_state=42,
test_size=0.2)

Create LSTM Model

model = Sequential()

model.add(LSTM(128, activation='tanh', input_shape=(X_train.shape[1],

X_train.shape[2]), return_sequences=False))

model.add(Dropout(0.5))

model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(), loss='binary_crossentropy',
metrics=['accuracy'])

model.summary()

/usr/local/lib/python3.10/dist-packages/keras/src/layers/rnn/
rnn.py:204: UserWarning: Do not pass an ìnput_shape`/ìnput_dim`
argument to a layer. When using Sequential models, prefer using an
Ìnput(shape)` object as the first layer in the model instead.
super().__init__(**kwargs)

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳
━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃
Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇
━━━━━━━━━━━━━━━━━┩
│ lstm (LSTM) │ (None, 128) │
590,336 │
├──────────────────────────────────────┼─────────────────────────────┼
─────────────────┤
│ dropout (Dropout) │ (None, 128) │
0 │
├──────────────────────────────────────┼─────────────────────────────┼
─────────────────┤
│ dense (Dense) │ (None, 64) │
8,256 │
├──────────────────────────────────────┼─────────────────────────────┼
─────────────────┤
│ dense_1 (Dense) │ (None, 1) │
65 │
└──────────────────────────────────────┴─────────────────────────────┴
─────────────────┘

Total params: 598,657 (2.28 MB)

Trainable params: 598,657 (2.28 MB)

Non-trainable params: 0 (0.00 B)

Train Model
history = model.fit(X_train, y_train, epochs=10, batch_size=32,
validation_data=(X_test, y_test))

Epoch 1/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 5s 5ms/step - accuracy: 0.8416 - loss:
0.3442 - val_accuracy: 0.8832 - val_loss: 0.2624
Epoch 2/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.8986 - loss:
0.2413 - val_accuracy: 0.8813 - val_loss: 0.2663
Epoch 3/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.8975 - loss:
0.2346 - val_accuracy: 0.8807 - val_loss: 0.2647
Epoch 4/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9015 - loss:
0.2333 - val_accuracy: 0.8870 - val_loss: 0.2523
Epoch 5/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.9055 - loss:
0.2187 - val_accuracy: 0.8845 - val_loss: 0.2537
Epoch 6/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9098 - loss:
0.2169 - val_accuracy: 0.8939 - val_loss: 0.2471
Epoch 7/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9044 - loss:
0.2226 - val_accuracy: 0.8889 - val_loss: 0.2555
Epoch 8/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9186 - loss:
0.1984 - val_accuracy: 0.8889 - val_loss: 0.2499
Epoch 9/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9118 - loss:
0.2057 - val_accuracy: 0.8895 - val_loss: 0.2556
Epoch 10/10
198/198 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.9138 - loss:
0.1863 - val_accuracy: 0.8864 - val_loss: 0.2593

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,10))

# First subplot
ax[0].plot(history.history['accuracy'],label="Accuracy",color="navy")
ax[0].plot(history.history['val_accuracy'],label="Validation
Accuracy",color="green")
ax[0].set_title('Model Accuracy')
ax[0].set_ylabel('Accuracy')
ax[0].set_xlabel('Epoch')
ax[0].legend(loc='best')

# Second subplot
ax[1].plot(history.history['loss'],label="Loss",color="blue")
ax[1].plot(history.history['val_loss'],label="Validation
Loss",color="crimson")
ax[1].set_title('Model Loss')
ax[1].set_ylabel('Loss')
ax[1].set_xlabel('Epoch')
ax[1].legend(loc='best')

plt.show()
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
roc_auc_score,
roc_curve,
average_precision_score,
precision_recall_curve,
log_loss
)

label_name=["Positive","Negative"]
y_pred = (model.predict(X_test) > 0.5).astype("int32")
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary')

recall = recall_score(y_test, y_pred, average='binary')

f1 = f1_score(y_test, y_pred, average='binary')

conf_matrix = confusion_matrix(y_test, y_pred)

class_report = classification_report(y_test,
y_pred,target_names=label_name)

print("###############################################################
#########################")
print("Accuracy:", accuracy)
print("###############################################################
#########################")
print("Precision:", precision)
print("###############################################################
#########################")
print("Recall:", recall)
print("###############################################################
#########################")
print("F1 Score:", f1)
print("###############################################################
#########################")
print("Confusion Matrix:\n", conf_matrix)
print("###############################################################
#########################")
print("Classification Report:\n", class_report)
print("###############################################################
#########################")
50/50 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
######################################################################
##################
Accuracy: 0.8863636363636364
######################################################################
##################
Precision: 0.7727272727272727
######################################################################
##################
Recall: 0.8263888888888888
######################################################################
##################
F1 Score: 0.7986577181208053
######################################################################
##################
Confusion Matrix:
[[1047 105]
[ 75 357]]
######################################################################
##################
Classification Report:
precision recall f1-score support

Positive 0.93 0.91 0.92 1152

Negative 0.77 0.83 0.80 432

accuracy 0.89 1584

macro avg 0.85 0.87 0.86 1584
weighted avg 0.89 0.89 0.89 1584

######################################################################
##################

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, model.predict(X_test))
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area =
{roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

50/50 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

precision, recall, thresholds = precision_recall_curve(y_test,
model.predict(X_test))

plt.figure(figsize=(10,8))
plt.plot(recall, precision, color='blue', lw=2)
plt.fill_between(recall, precision, color='black', alpha=0.9)
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
plt.show()

50/50 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

plt.figure(figsize=(10,8))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="BrBG",
xticklabels=label_name, yticklabels=label_name)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
roc_auc = roc_auc_score(y_test, y_pred)
plt.plot([])
plt.text(0,0, f'ROC AUC Score: {roc_auc:.4f}', fontsize=16,
ha='center', va='center',color="indigo")
plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()
logarithm_loss=log_loss(y_test,y_pred)
plt.plot([])
plt.text(0,0, f'Log Loss: {logarithm_loss:.4f}', fontsize=16,
ha='center', va='center',color="black")
plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()
from sklearn.metrics import cohen_kappa_score,matthews_corrcoef
kappa = cohen_kappa_score(y_test,y_pred)
plt.plot([])
plt.text(0,0, f'Cohen Kappa Score: {kappa:.4f}', fontsize=16,
ha='center', va='center',color="orangered")
plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()
mcc = matthews_corrcoef(y_test,y_pred)

# Create a plot and display the MCC value as text

plt.plot([])
plt.text(0,0, f'Matthews Correlation Coefficient: {mcc:.4f}',
fontsize=16, ha='center', va='center',color="saddlebrown")
plt.axis('off')

# Set the x-axis limits

plt.xlim(-1, 1)
plt.ylim(-1,1)

plt.show()
Custom Data Prediction
# Define your custom data (sample)
custom_data = ["This is a great product!", "I did not like this
item!"]

# Preprocess custom data (assuming basic cleaning for example)

custom_data_cleaned = [text.lower() for text in custom_data]

# Get ELMo embeddings for the custom data

custom_data_embeddings =
np.concatenate([elmo_vectors(custom_data_cleaned)], axis=0)

# Expand dimensions to match model input format

custom_data_embeddings = np.expand_dims(custom_data_embeddings,
axis=1)

# Predict using the trained model

custom_predictions = model.predict(custom_data_embeddings)

# Convert predictions to binary (0 for positive, 1 for negative)

custom_predictions = (custom_predictions > 0.5).astype("int32")

for text, prediction in zip(custom_data, custom_predictions):

sentiment = "Negative" if prediction == 1 else "Positive"
print(f"Text: {text}")
print(f"Prediction: {sentiment}")

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step

Text: This is a great product!
Prediction: Negative
Text: I did not like this item!
Prediction: Negative

Prediction On test Data

test_data.head()

id clean_tweet
0 7921 i hate the new iphone upgrade. won't let me do...
1 7922 currently shitting my fucking pants. apple ima...
2 7923 i'd like to puts some cdroms on my ipad, is th...
3 7924 my ipod is officially dead. i lost all my pict...
4 7925 been fighting itunes all night i only want the...

prediction=model.predict(elmo_test_new)
prediction=(prediction>0.5).astype(int)
test_data["predicted_label"]=prediction
test_data["predicted_label"]=test_data["predicted_label"].map({0:"Posi
tive",1:"Negative"})
test_data.head()

62/62 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

id clean_tweet
predicted_label
0 7921 i hate the new iphone upgrade. won't let me do...
Negative
1 7922 currently shitting my fucking pants. apple ima...
Negative
2 7923 i'd like to puts some cdroms on my ipad, is th...
Negative
3 7924 my ipod is officially dead. i lost all my pict...
Negative
4 7925 been fighting itunes all night i only want the...
Negative

Deep Security 11 Certified Professional - Exam: Questions: 45 - Attempts: 3
100% (2)
Deep Security 11 Certified Professional - Exam: Questions: 45 - Attempts: 3
20 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
640848728-Part-C-Assignment-No-2-Mini-project-on-Twitter-1
No ratings yet
640848728-Part-C-Assignment-No-2-Mini-project-on-Twitter-1
9 pages
ChatGPT Twitter Sentiment Analyzer
No ratings yet
ChatGPT Twitter Sentiment Analyzer
50 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Twitter Project2
No ratings yet
Twitter Project2
339 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
Cyberbullying Tweet Recognition Project 1677256740
No ratings yet
Cyberbullying Tweet Recognition Project 1677256740
17 pages
Prototype 1
No ratings yet
Prototype 1
10 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
SMA EXP 09 CODE PRINT
No ratings yet
SMA EXP 09 CODE PRINT
5 pages
SMA 3
No ratings yet
SMA 3
3 pages
Gathering Data: E-Predictions/image-Predictions - TSV
No ratings yet
Gathering Data: E-Predictions/image-Predictions - TSV
3 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
Social Media Sentimental Analysis 1
No ratings yet
Social Media Sentimental Analysis 1
30 pages
1729401471516
No ratings yet
1729401471516
98 pages
nlp-practical-three
No ratings yet
nlp-practical-three
8 pages
Wrangle Report
No ratings yet
Wrangle Report
4 pages
assign 5 tt
No ratings yet
assign 5 tt
13 pages
Text Mining Using Python
No ratings yet
Text Mining Using Python
1 page
vertopal.com_C1_W2_Assignment
No ratings yet
vertopal.com_C1_W2_Assignment
18 pages
There Were 50 or More Warnings
No ratings yet
There Were 50 or More Warnings
15 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
Sentiment Analysis - Comparing Algorithms Accuracy
No ratings yet
Sentiment Analysis - Comparing Algorithms Accuracy
22 pages
NLP
No ratings yet
NLP
45 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
16 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Analyzing Social Media Data in Python Chapter2
No ratings yet
Analyzing Social Media Data in Python Chapter2
30 pages
Lecture 1:set Up Jupyter, Import Data From Web and Select Cases
No ratings yet
Lecture 1:set Up Jupyter, Import Data From Web and Select Cases
16 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Machine Learning Project Spam SMS Classification 1684945672
No ratings yet
Machine Learning Project Spam SMS Classification 1684945672
18 pages
Hate Speech Detection
No ratings yet
Hate Speech Detection
6 pages
Importing Packages: Id Label Tweet 0 1 2 3 4
No ratings yet
Importing Packages: Id Label Tweet 0 1 2 3 4
8 pages
TP 10 Big Data (Ega Sarmita) PDF
No ratings yet
TP 10 Big Data (Ega Sarmita) PDF
6 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
AI PROJECT FILE
No ratings yet
AI PROJECT FILE
11 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
NLP_Sentimental_Analysis__1736351356
No ratings yet
NLP_Sentimental_Analysis__1736351356
32 pages
Super Visionado VSRegras
No ratings yet
Super Visionado VSRegras
6 pages
Emotion Classification with DistilBERT
No ratings yet
Emotion Classification with DistilBERT
25 pages
code
No ratings yet
code
13 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Manual
No ratings yet
Manual
48 pages
Sma Exp 03 Code Print
No ratings yet
Sma Exp 03 Code Print
5 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
C1_W1_Assignment (2)
No ratings yet
C1_W1_Assignment (2)
14 pages
Fake-News-Detection-EDA-Case-Study
No ratings yet
Fake-News-Detection-EDA-Case-Study
57 pages
Twitter Python Assignment
No ratings yet
Twitter Python Assignment
8 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Face Mask Detection Using Python and Deep Learning
No ratings yet
Face Mask Detection Using Python and Deep Learning
16 pages
Data science curriculum
No ratings yet
Data science curriculum
22 pages
Syllabus
No ratings yet
Syllabus
48 pages
Suresh Research Paper Work File (Baini)
No ratings yet
Suresh Research Paper Work File (Baini)
28 pages
HRDC JNTUH RC AI and ML
No ratings yet
HRDC JNTUH RC AI and ML
2 pages
Fuzzy Means Algorithm
No ratings yet
Fuzzy Means Algorithm
14 pages
Recognition_of_Fingerprint_Images_using_CNN_for_Cybercrime_Detection_System
No ratings yet
Recognition_of_Fingerprint_Images_using_CNN_for_Cybercrime_Detection_System
6 pages
Seecs 2712
No ratings yet
Seecs 2712
26 pages
Ai and Machine Learning For Business
No ratings yet
Ai and Machine Learning For Business
114 pages
Deepak Report
No ratings yet
Deepak Report
26 pages
project proposal chi
No ratings yet
project proposal chi
6 pages
Y21 CSIT Ashish Mehata Exit Requirements
No ratings yet
Y21 CSIT Ashish Mehata Exit Requirements
20 pages
M5 - Custom Model Building With SQL in BigQuery ML Slides
No ratings yet
M5 - Custom Model Building With SQL in BigQuery ML Slides
32 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
Machine Learning-Based Timeseries Analysis For Cryptocurrency Price Prediction A Systematic Review and Research IEEE
No ratings yet
Machine Learning-Based Timeseries Analysis For Cryptocurrency Price Prediction A Systematic Review and Research IEEE
5 pages
Lecture Notes 6
No ratings yet
Lecture Notes 6
5 pages
Computer Aided Jig and Fixture Design
No ratings yet
Computer Aided Jig and Fixture Design
18 pages
paper2
No ratings yet
paper2
9 pages
Akhilesh - Kashyap Mini
No ratings yet
Akhilesh - Kashyap Mini
86 pages
Where can buy Geometry of Deep Learning: A Signal Processing Perspective Ye ebook with cheap price
100% (3)
Where can buy Geometry of Deep Learning: A Signal Processing Perspective Ye ebook with cheap price
40 pages
wHAT IS ARTIFICIAL INTELLIGENCE
No ratings yet
wHAT IS ARTIFICIAL INTELLIGENCE
3 pages
Machine Learning A Review On Binary Classification
No ratings yet
Machine Learning A Review On Binary Classification
5 pages
Presentation 1
No ratings yet
Presentation 1
26 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
16 pages
Wa0141.
No ratings yet
Wa0141.
32 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
An Empirical Evaluation of Generic Convolutional and Recurrent Networks For Sequence Modeling
No ratings yet
An Empirical Evaluation of Generic Convolutional and Recurrent Networks For Sequence Modeling
14 pages
AI Gains Momentum in Core Financial Services Functions 1691763089
No ratings yet
AI Gains Momentum in Core Financial Services Functions 1691763089
9 pages

Data Science Project

Uploaded by

Data Science Project

Uploaded by

import pandas as pd

1 represents a negative tweet

Clean Train Data

remove URL's from train

remove punctuation marks

convert text to lowercase

Clean Test Data

Visualize Data From Train Data

plt.hist(positive_len, bins=40,label='Ham Data Length',color="red")

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato",

Most Common Words From Positive Text

Most Common Words From Negative Text

Visualize Data From Test Data

colors = ["cyan", "lime", "magenta", "gold", "purple", "tomato",

• ELMo is a deep contextualized word in vectors or embeddings, often referred to as

How Does ELMo Works?

Now, let’s come back to how ELMo works.

• The architecture above uses a character-level convolutional neural network (CNN) to

Let’s define a function for doing this:

Splitting Data into Batches for Training and

list_train = [df[i:i+batch_size] for i in range(0, df.shape[0],

Generating ELMo Embeddings for Training and

elmo_train_new = np.expand_dims(elmo_train_new, axis=1)

X_train,X_test, y_train, y_test =

Create LSTM Model

model.add(LSTM(128, activation='tanh', input_shape=(X_train.shape[1],

Total params: 598,657 (2.28 MB)

Trainable params: 598,657 (2.28 MB)

Non-trainable params: 0 (0.00 B)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,10))

recall = recall_score(y_test, y_pred, average='binary')

f1 = f1_score(y_test, y_pred, average='binary')

conf_matrix = confusion_matrix(y_test, y_pred)

Positive 0.93 0.91 0.92 1152

accuracy 0.89 1584

from sklearn.metrics import roc_curve, auc

50/50 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

50/50 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

# Set the x-axis limits

# Set the x-axis limits

# Set the x-axis limits

# Create a plot and display the MCC value as text

# Set the x-axis limits

# Preprocess custom data (assuming basic cleaning for example)

# Get ELMo embeddings for the custom data

# Expand dimensions to match model input format

# Predict using the trained model

# Convert predictions to binary (0 for positive, 1 for negative)

for text, prediction in zip(custom_data, custom_predictions):

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 18ms/step

Prediction On test Data

62/62 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step

You might also like