0% found this document useful (0 votes)
53 views16 pages

Recurring Neural Networks For Sequence - Sentiment Analysis With The IMDb Dataset - Ipynb - Colaboratory

The document discusses preparing text data for sentiment analysis using recurrent neural networks and LSTM. It loads 25,000 training and 25,000 testing movie reviews from IMDB, encodes the text into integer indices, pads the sequences to a maximum length of 200 words, and splits the test set into 20,000 test and 5,000 validation samples. It then defines an LSTM model with an embedding layer, LSTM layer, and dense output layer, compiles it to optimize for binary cross entropy loss, and fits it to the training data for 50 epochs while monitoring validation accuracy with TensorBoard callbacks.

Uploaded by

fardowsa mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views16 pages

Recurring Neural Networks For Sequence - Sentiment Analysis With The IMDb Dataset - Ipynb - Colaboratory

The document discusses preparing text data for sentiment analysis using recurrent neural networks and LSTM. It loads 25,000 training and 25,000 testing movie reviews from IMDB, encodes the text into integer indices, pads the sequences to a maximum length of 200 words, and splits the test set into 20,000 test and 5,000 validation samples. It then defines an LSTM model with an embedding layer, LSTM layer, and dense output layer, compiles it to optimize for binary cross entropy loss, and fits it to the training data for 50 epochs while monitoring validation accuracy with TensorBoard callbacks.

Uploaded by

fardowsa mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

# Dataset to use - Sentiment Analaysis (Feeling) - binary problem

# Positive or negative

Recurrent Neural Networks


process sequence of data (- time series or text in sentences)

"Recurrent"- contains loops

output of one given layer in the input of the same layer in the next step

Time step

the next point in time series

Next work in sequence of words for a text sequence

RNN takes into account the realtionship and later data in a sequence

Text mining -

LSTM - long short term memory - neural network - recurrent

learning from sequence

Prediction text input

sentiment analysis

responding to Q with predicted best answers

inter language translation

automated video closed captioning - speech recognition

Speech synthesis

#load the data

25000 training samples and 25000 testing samples - 1(postive)


or 0 (negative)

from tensorflow.keras.datasets import imdb

contains over 88 000 unique words in the data


specify how many to use
10 000 - system memory limitation,
GPU - TPU
More data means you get better models
number_of_words = 10000

# Tensorflow keras and Numpy

import numpy as np

np_load_old=np.load

#modify the default parameters of np.load

np.load=lambda a, *k: np_load(a, allow_pickle=True, **k)

np.load=lambda *a, **k: np_load_old(*a, allow_pickle=True)

(X_train, y_train),(X_test,y_test)=imdb.load_data(num_words=number_of_words)

np.load = np_load_old

# data exploration

check the sample and target dimension

X_train.shape

(25000,)

X_test.shape

(25000,)

%pprint

Pretty printing has been turned ON

X_train[1]
11,
220,
175,
136,
50,
9,
4373,
228,
8255,
5,
2,
656,
245,
2350,
5,
4,
9837,
131,
152,
491,
18,
2,
32,
7464,
1212,
14,
9,
6,
371,
78,
22,
625,
64,
1382,
9,
8,
168,
145,
23,
4,
1690,
15,
16,
4,
1355,
5,
28,
6,
52,
154,
462,
33,
89,
78,
285,
16,
145,
95]

y_train[0]

# movie review encoding

# ranking values have an offset by 3 most occuring words 4


# 0, 1 and 2 reserved:
#padding

word_to_index = imdb.get_word_index()
word_to_index['movie']

17

word_to_index['the']

word_to_index['wood']

2134

#decoding a movie review

index_to_word={index: word for (word, index) in word_to_index.items()}

# top 50 most frequent words

[index_to_word[i] for i in range(1,51)]

['the',
'and',
'a',
'of',
'to',
'is',
'br',
'in',
'it',
'i',
'this',
'that',
'was',
'as',
'for',
'with',
'movie',
'but',
'film',
'on',
'not',
'you',
'are',
'his',
'have',
'he',
'be',
'one',
'all',
'at',
'by',
'an',
'they',
'who',
'so',
'from',
'like',
'her',
'or',
'just',
'about',
"it's",
'out',
'has',
'if',
'some',
'there',
'what',
'good',
'more']

' '.join([index_to_word.get(i-3,'?') for i in X_train[0]])

'? this film was just brilliant casting location scenery story direction everyone's really s
uited the part they played and you could just imagine being there robert ? is an amazing act
or and now the same being director ? father came from the same scottish island as myself so
i loved the fact there was a real connection with this film the witty remarks throughout the
film were great it was just brilliant so much that i bought the film as soon as it was relea
sed for ? and would recommend it to everyone to watch and the fly fishing was amazing really

# data preparation

words_per_review=200

from tensorflow.keras.preprocessing.sequence import pad_sequences

X_train=pad_sequences(X_train, maxlen=words_per_review)

X_train.shape

(25000, 200)

X_train[0]

array([ 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35,


480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385,
39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4,
192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920,
4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38,
76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16,
626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106,
5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130,
12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48,
25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5,
14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15,
256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530,
476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104,
88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141,
6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26,
480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92,
25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16,
283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19,
178, 32], dtype=int32)
X_train.ndim

X_test= pad_sequences(X_test, maxlen = words_per_review)

X_test.shape

(25000, 200)

# splittng the data into validation and test data

split the 25 000 test samples into 20 000 test samples and 5000 validation We pass the validation
samples to the models fit method. validation_data argument.

from sklearn.model_selection import train_test_split

(X_test, X_val, y_test, y_val)=train_test_split(X_test, y_test, random_state=1, test_size=0.20)

X_test.shape

(20000, 200)

X_test

array([[ 0, 0, 0, ..., 12, 100, 593],


[ 66, 6, 2479, ..., 53, 3454, 438],
[ 0, 0, 0, ..., 26, 131, 642],
...,
[ 13, 286, 252, ..., 30, 38, 642],
[ 46, 7, 2551, ..., 5, 2167, 9435],
[9180, 4, 2, ..., 5, 443, 5698]], dtype=int32)

X_val.shape

(5000, 200)

# creating the neural network

from tensorflow.keras.models import Sequential

rnn=Sequential()

rnn

<keras.engine.sequential.Sequential at 0x7f940691e610>

from tensorflow.keras.layers import Dense, LSTM, Embedding


adding an embedding

rnn.add(Embedding(input_dim=number_of_words, output_dim=128, input_length=words_per_review))

#adding LTSM layer

units - number of neurons in the layer.


length of the sequences (200) - number of classes we want to predict
positive or negative - 2 classes
dropout - percentage of neurons to be disabled.

rnn.add(LSTM(units=128, dropout=0.2,recurrent_dropout=0))

#Add a dense output layer

- Reduce the LSTM to one result, + or -, units -1. The activaton function to be used sigmoid - binary cl

rnn.add(Dense(units=1, activation='sigmoid'))

#compile the model and display the summary of the model

rnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

rnn.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 200, 128) 1280000

lstm_1 (LSTM) (None, 128) 131584

dense_1 (Dense) (None, 1) 129

=================================================================
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________

from tensorflow.keras.callbacks import TensorBoard


import tensorflow as tf

import os, datetime , time

logdir=os.path.join("logs/fit", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

tensorboard_callback= tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:


%reload_ext tensorboard

%tensorboard --logdir logs/fit


Reusing TensorBoard on port 6006 (pid 244), started 1:16:11 ago. (Use '!kill 244' to kill it

TensorBoard SCALARS GRAPHS DISTRIBUINACTIVE

Show data download links Filter tags (regular expressions supported)


Ignore outliers in chart scaling
epoch_accuracy
Tooltip sorting
default
method:
epoch_accuracy
tag: epoch_accuracy
Smoothing
1
0.6
0.96

0.92
Horizontal Axis

0.88
STEP RELATIVE

0.84
WALL
0 2 4 6 8 10 12 14 16 18
rnn.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[t
Runs
Epoch 23/50
782/782 [==============================] - 36s 46ms/step - loss: 0.0076 - accuracy: 0.9974
Write
Epoch a regex to filter runs
24/50
epoch_loss
782/782 [==============================] - 36s 46ms/step - loss: 0.0028 - accuracy: 0.9990
Epoch 25/5020211129-040330/train
782/782 [==============================]epoch_loss
- 40s 51ms/step - loss: 0.0059 - accuracy: 0.9978
20211129-040839/train
Epoch 26/50 tag: epoch_loss
782/782 [==============================]
20211129-040839/validati - 36s 46ms/step - loss: 0.0038 - accuracy: 0.9990
Epoch 27/50on 0.8
782/782 [==============================]
20211129-050633/train - 40s 51ms/step - loss: 0.0083 - accuracy: 0.9974
Epoch 28/50 0.6
20211129-050633/validati
782/782 [==============================] - 40s 51ms/step - loss: 0.0049 - accuracy: 0.9983
Epoch 29/50on 0.4
782/782 [==============================]
20211129-052455/train - 36s 46ms/step - loss: 0.0039 - accuracy: 0.9988
Epoch 30/50 0.2
782/782 [==============================]
TOGGLE ALL RUNS - 36s 46ms/step - loss: 0.0026 - accuracy: 0.9992
Epoch 31/50 0
logs/fit
782/782 [==============================] - 40s 51ms/step - loss: 0.0024 - accuracy: 0.9994
Epoch 32/50 0 2 4 6 8 10 12 14 16 18
782/782 [==============================] - 37s 47ms/step - loss: 0.0075 - accuracy: 0.9977
Epoch 33/50
782/782 [==============================] - 36s 46ms/step - loss: 0.0043 - accuracy: 0.9988
Epoch 34/50
782/782 [==============================] - 40s 51ms/step - loss: 0.0031 - accuracy: 0.9989
Epoch 35/50
782/782 [==============================] - 36s 47ms/step - loss: 0.0023 - accuracy: 0.9993
Epoch 36/50
782/782 [==============================] - 40s 51ms/step - loss: 0.0042 - accuracy: 0.9989
Epoch 37/50
782/782 [==============================] - 37s 47ms/step - loss: 0.0022 - accuracy: 0.9996
Epoch 38/50
782/782 [==============================] - 40s 51ms/step - loss: 0.0031 - accuracy: 0.9990
Epoch 39/50
782/782 [==============================] - 36s 46ms/step - loss: 0.0063 - accuracy: 0.9984
Epoch 40/50
782/782 [==============================] - 36s 46ms/step - loss: 0.0040 - accuracy: 0.9988
Epoch 41/50
782/782 [==============================] - 40s 51ms/step - loss: 7.9363e-04 - accuracy: 0.
Epoch 42/50
782/782 [==============================] - 40s 52ms/step - loss: 0.0016 - accuracy: 0.9994
Epoch 43/50
782/782 [==============================] - 37s 47ms/step - loss: 0.0022 - accuracy: 0.9993
Epoch 44/50
782/782 [==============================] - 36s 47ms/step - loss: 0.0036 - accuracy: 0.9988
Epoch 45/50
782/782 [==============================] - 40s 51ms/step - loss: 0.0019 - accuracy: 0.9995
Epoch 46/50
782/782 [==============================] - 40s 51ms/step - loss: 4.2829e-04 - accuracy: 0.
Epoch 47/50
782/782 [==============================] - 40s 51ms/step - loss: 8.9735e-04 - accuracy: 0.
Epoch 48/50
782/782 [==============================] - 37s 47ms/step - loss: 0.0043 - accuracy: 0.9991
Epoch 49/50
782/782 [==============================] - 37s 47ms/step - loss: 0.0022 - accuracy: 0.9994
Epoch 50/50
782/782 [==============================] - 40s 52ms/step - loss: 4.6063e-04 - accuracy: 1.
<keras.callbacks.History at 0x7f93ff5e3c90>

# training and evaluating the model


results = rnn.evaluate(X_test, y_test)

625/625 [==============================] - 10s 16ms/step - loss: 1.1162 - accuracy: 0.8477

results

[1.1162388324737549, 0.8476999998092651]

Make up

1. Add tensorboard -------[5 marks]


2. Run the codes using epoches - 10, 20, 50 -comment about the time, performance (accuracy)-------
[10 marks]
3. Research on chatbots and recurrent neural networks. Locate, study and run chatbox examples
implemented in keras RNNs---------[15 marks]

Run the codes using epoches - 10, 20, 50 -comment about the time, performance (accuracy)------- [10
marks]
After running the epoches 10, 20 and 50 times I have realized that the time slighly reduced as the
number of epoches are increased.
The accuracy is higher when I run 50 compared to running 10 epoches.

NLP
NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart
and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such
as automatic summarization, translation, named entity recognition, relationship extraction, sentiment
analysis, speech recognition, and topic segmentation.

Import necessary libraries

import io
import random
import string # to process standard python strings
import warnings
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')
Downloading and installing NLTK
NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human
language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as
WordNet, along with a suite of text processing libraries for classification, tokenization, stemming,
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

pip install nltk

Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.2.5)


Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk) (1.

Installing NLTK Packages

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('popular', quiet=True) # for downloading packages
#nltk.download('punkt') # first-time use only
#nltk.download('wordnet') # first-time use only

True

Reading in the corpus


For our example,we will be using the Wikipedia page for chatbots as our corpus. Copy the contents from
the page and place it in a text file named ‘chatbot.txt’. However, you can use any corpus of your choice.

f=open('chatbot.txt','r',errors = 'ignore')
raw=f.read()
raw = raw.lower()# converts to lowercase

The main issue with text data is that it is all in text format (strings). However, the Machine learning
algorithms need some sort of numerical feature vector in order to perform the task. So before we start
with any NLP project we need to pre-process it to make it ideal for working. Basic text pre-processing
includes:

Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the
same words in different cases as different

Tokenization: Tokenization is just the term used to describe the process of converting the normal
text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to
find the list of sentences and Word tokenizer can be used to find the list of words in strings.

The NLTK data package includes a pre-trained Punkt tokenizer for English.
Removing Noise i.e everything that isn’t in a standard number or letter.
Removing the Stop words. Sometimes, some extremely common words which would appear to be
of little value in helping select documents matching a user need are excluded from the vocabulary
entirely. These words are called stop words
Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their
stem, base or root form — generally a written word form. Example if we were to stem the following
words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word
“stem”.
Lemmatization: A slight variant of stemming is lemmatization. The major difference between these
is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your
root stem, meaning the word you end up with, is not something you can just look up in a dictionary,
but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words
like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are
considered the same.

Tokenisation

sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences


word_tokens = nltk.word_tokenize(raw)# converts to list of words

Preprocessing
We shall now define a function called LemTokens which will take as input the tokens and return
normalized tokens.

lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Keyword matching
Next, we shall define a function for a greeting by the bot i.e if a user’s input is a greeting, the bot shall
return a greeting response.ELIZA uses a simple keyword matching for greetings. We will utilize the same
concept here.

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)


GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to m
def greeting(sentence):
for word in sentence.split():
if word.lower() in GREETING_INPUTS:
return random.choice(GREETING_RESPONSES)

Generating Response
Bag of Words
After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of
numbers. The bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:

A vocabulary of known words.

A measure of the presence of known words.

Why is it is called a “bag” of words? That is because any information about the order or structure of
words in the document is discarded and the model is only concerned with whether the known words
occur in the document, not where they occur in the document.

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we
can learn something about the meaning of the document from its content alone.

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize
the text “Learning is great”, we would have the following vector: (1, 1, 0, 0, 1).

TF-IDF Approach
A problem with the Bag of Words approach is that highly frequent words start to dominate in the
document (e.g. larger score), but may not contain as much “informational content”. Also, it will give more
weight to longer documents than shorter documents.

One approach is to rescale the frequency of words by how often they appear in all documents so that the
scores for frequent words like “the” that are also frequent across all documents are penalized. This
approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short, where:

Term Frequency: is a scoring of the frequency of the word in the current document.

TF = (Number of times term t appears in a document)/(Number of terms in the document)

Inverse Document Frequency: is a scoring of how rare the word is across documents.

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appear

Cosine Similarity
Tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical
measure used to evaluate how important a word is to a document in a collection or corpus
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||

where d1,d2 are two non zero vectors.

To generate a response from our bot for input questions, the concept of document similarity will be used.
We define a function response which searches the user’s utterance for one or more known keywords and
returns one of several possible responses. If it doesn’t find the input matching any of the keywords, it
returns a response:” I am sorry! I don’t understand you”

def response(user_response):
robo_response=''
sent_tokens.append(user_response)
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)
vals = cosine_similarity(tfidf[-1], tfidf)
idx=vals.argsort()[0][-2]
flat = vals.flatten()
flat.sort()
req_tfidf = flat[-2]
if(req_tfidf==0):
robo_response=robo_response+"I am sorry! I don't understand you"
return robo_response
else:
robo_response = robo_response+sent_tokens[idx]
return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation
depending upon user’s input.

flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type
while(flag==True):
user_response = input()
user_response=user_response.lower()
if(user_response!='bye'):
if(user_response=='thanks' or user_response=='thank you' ):
flag=False
print("ROBO: You are welcome..")
else:
if(greeting(user_response)!=None):
print("ROBO: "+greeting(user_response))
else:
print("ROBO: ",end="")
print(response(user_response))
sent_tokens.remove(user_response)
else:
flag=False
print("ROBO: Bye! take care..")
ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type
hi
ROBO: *nods*
describe chatbot design
ROBO: design
the chatbot design is the process that defines the interaction between the user and the chat

arrow_right Executing (1m 58s) Cellnavigate_nextraw_input()navigate_next_input_request()navigate_nextrecv()navigate_nextrecv_multipart()

You might also like