Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
HAN
Akshat Maheshwari Follow
Jul 17, 2018 · 8 min read
Introduction
Hello World!! I recently joined Jatana.ai as NLP Researcher (Intern 😇) and I was
asked to work on the text classification use cases using Deep learning models.
In this article I will share my experiences and learnings while experimenting with
various neural networks architectures.
Text classification was performed on datasets having Danish, Italian, German, English
and Turkish languages.
The goal of text classification is to automatically classify the text documents into one
or more predefined categories.
Text Classification is a very active research area both in academia 📚 and industry. In
this post, I will try to present a few different approaches and compare their
performances, where implementation is based on Keras.
All the source code and the results of experiments can be found in
jatana_research repository.
3. Labels: These are the predefined categories/classes that our model will predict
4. ML Algo: It is the algorithm through which our model is able to deal with text
classification (In our case : CNN, RNN, HAN)
5. Predictive Model: A model which is trained on the historical dataset which can
perform label predictions.
I have taken reference from Yoon Kim paper and this blog by Denny Britz.
CNNs are generally used in computer vision, however they’ve recently been applied to
various NLP tasks and the results were promising 🙌 .
Let’s briefly see what happens when we use CNN on text data through a diagram.The
result of each convolution will fire when a special pattern is detected. By varying the
size of the kernels and concatenating their outputs, you’re allowing yourself to detect
patterns of multiples sizes (2, 3, or 5 adjacent words).Patterns could be expressions
(word ngrams?) like “I hate”, “very good” and therefore CNNs can identify them in the
sentence regardless of their position.
Image Reference : http://www.wildml.com/2015/11/understanding-convolutional-
neural-networks-for-nlp/
In this section, I have used a simplified CNN to build a classifier. So first use Beautiful
Soup in order to remove some HTML tags and some unwanted characters.
def clean_str(string):
string = re.sub(r"\\", "", string)
string = re.sub(r"\'", "", string)
string = re.sub(r"\"", "", string)
return string.strip().lower()
texts = [];labels = []
for i in range(df.message.shape[0]):
text = BeautifulSoup(df.message[i])
texts.append(clean_str(str(text.get_text().encode())))
for i in df['class']:
labels.append(i)
Here I have used Google Glove 6B vector 100d. Its Official documentation :
For an unknown word, the following code will just randomise its vector. Below is a
very simple Convolutional Architecture, using a total of 128 filters with size 5 and max
pooling of 5 and 35, following the sample from this blog.
Using the knowledge from an external embedding can enhance the precision of your
RNN because it integrates new information (lexical and semantic) about the words, an
information that has been trained and distilled on a very large corpus of data.The pre-
trained embedding we’ll be using is GloVe.
RNNs may look scary 😱 . Although they’re complex to understand, they’re quite
interesting. They encapsulate a very beautiful design that overcomes traditional
neural networks’ shortcomings that arise when dealing with sequence data: text, time
series, videos, DNA sequences, etc.
RNN is a sequence of neural network blocks that are linked to each others like a chain.
Each one is passing a message to a successor. Again if you want to dive into the
internal mechanics, I highly recommend Colah’s blog.
Image Reference : http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Same preprocessing is also done here using Beautiful Soup. We will process text data,
which is a sequence type. The order of words is very important to the meaning.
Hopefully RNNs take care of this and can capture long-term dependencies.
To use Keras on text data, we first have to preprocess it. For this, we can use Keras’
Tokenizer class. This object takes as argument num_words which is the maximum
number of words kept after tokenization based on their word frequency.
MAX_NB_WORDS = 20000
tokenizer = Tokenizer (num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
Once the tokenizer is fitted on the data, we can use it to convert text strings to
sequences of numbers. These numbers represent the position of each word in the
dictionary (think of it as mapping).
In this section, I will try to tackle the problem by using recurrent neural network
and attention based LSTM encoder.
By using LSTM encoder, we intent to encode all the information of text in the last
output of Recurrent Neural Network before running feed forward network for
classification.
I’m using LSTM layer in Keras to implement this. Other than forward LSTM, here
I have used bidirectional LSTM and concatenate both last output of LSTM
outputs.
Keras has provide a very nice wrapper called bidirectional, which will make this
coding exercise effortless. You can see the sample code here
tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
data = np.zeros((len(texts), MAX_SENTS, MAX_SENT_LENGTH),
dtype='int32')
After this we can use Keras magic function TimeDistributed to construct the
Hierarchical input layers as following. We can also refer to this post.
embedding_layer=Embedding(len(word_index)+1,EMBEDDING_DIM,weights=
[embedding_matrix],
input_length=MAX_SENT_LENGTH,trainable=True)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH),
dtype='int32')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(len(macronum), activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)
Results
Here are the plots for Accuracy 📈 and Loss 📉
Observations 👇 :
Based on the above plots, CNN has achieved good validation accuracy with high
consistency, also RNN & HAN have achieved high accuracy but they are not that
consistent throughout all the datasets.
RNN was found to be the worst architecture to implement for production ready
scenarios.
CNN model has outperformed the other two models (RNN & HAN) in terms of
training time, however HAN can perform better than CNN and RNN if we have a
huge dataset.
For dataset 1 and dataset 2 where the training samples are more, HAN has
achieved the best validation accuracy while when the training samples are very
low, then HAN has not performed that good (dataset 3).
When training samples are less (dataset 3) CNN has achieved the best validation
accuracy.
Performance Improvements :
To achieve the best performances 😉, we may:
Infrastructure setup:
All the above experiments were performed on 8 core vCPU’s with Nvidia Tesla K80
GPU.
Further all the experiments were performed under the guidance of Rahul Kumar 😎.
Also I would like to thanks Jatana.ai for providing me a very good infrastructure and
full support throughout my journey 😃.