Hindi Text Classification
Hindi Text Classification
A Comparison
2
Department of Computer Science and Engineering, Indian Institute of Technology
Madras
{rbjoshi1309, goyalpoorvi, ravirajoshi}@gmail.com
1 Introduction
Natural language processing represents computational techniques used for pro-
cessing human language. The language can either be represented in terms of
text or speech. NLP in the context of deep learning has become very popular
because of its ability to handle text which is far from being grammatically cor-
rect. Ability to learn from the data have made the machine learning system
powerful enough to process any type of unstructured text. Machine learning ap-
proaches have been used to achieve state of the art results on NLP tasks like
text classification, machine translation, question answering, text summarization,
text ranking, relation classification, and others.
The focus of our work is text classification of Hindi language. Text classifica-
tion is the most widely used NLP task. It finds application in sentiment analy-
sis, spam detection, email classification, and document classification to name a
ii R. Joshi et al.
word vectors externally and using them in the target task. These approaches
represent transfer learning in the context of NLP. Models like Skip-Thought
Vectors, Universal Sentence Encoder by Google, InferSent, and BERT have to
be used to learn sentence embeddings. Using pre-trained sentence embeddings
lowers the training time and is more robust on small target data sets. In this
work, we also evaluate pre-trained multi-lingual sentence embedding obtained
using BERT and LASER to draw a better comparison.
Main contributions of this paper are:
– Compare variations of CNN and LSTM models for Hindi text classification.
– Effectiveness of Hindi Fast-Text word embedding is evaluated.
– Effectiveness of multi-lingual pre-trained sentence embedding based on BERT
and LASER is evaluated on Hindi corpus.
2 Related Work
There has been limited literature on Hindi text classification. Arora Piyush in
his early work [1] used traditional n-gram and weighted n-grams method for
sentiment analysis of Hindi text. Tummalapalli et al. [9] used deep learning
techniques- basic CNN, LSTM and multi-Input CNN for evaluating the clas-
sification accuracy of Hindi and Telugu texts. Their main focus was capturing
morphological variations in Hindi language using word-level and character-level
features. CNN based models performed better as compared to LSTM and SVM
using n-gram features. The datasets used were created using translation. In this
work, we are concerned with the performance of different model architectures
and word vectors so we do not consider character level or subword level features.
In general, there has been a lot of research on text classification and sentiment
analysis employing supervised and sem-supervised techniques. Kim et al. [6] pro-
posed CNN based architecture for classification of English sentences. A simple
bag of words model based on averaging of fast text word vectors was proposed
in [5]. They proposed a simple fast baseline for sentence classification tasks.
Usage of RNNs for text classification was introduced in [7] and Bi-LSTM was
augmented with simple attention in [10]. Classification results of these models
on Hindi text are reported in this work.
Sentence embeddings evaluated in this work include multi-lingual LASER em-
beddings [2] and multi-lingual BERT based embeddings [4]. LASER uses Bi-
LSTM encoder to generate embeddings whereas BERT is based on Transformer
architecture. LASER takes a neural machine translation approach for learning
sentence representations. It builds a sequence to sequence model using Bi-LSTM
encoder-decoder architecture. The encoder Bi-LSTM is used to generate sentence
representations. BERT, on the other hand, uses bi-directional transformer en-
coder for learning word and sentence representations. It uses masked language
model as the pre-training objective to mitigate the problem of unidirectional
training in simple language model next word prediction task.
iv R. Joshi et al.
3 Datasets
– TREC question dataset which involves classifying a question sentence into
six types. The dataset has predefined train-test split. It has 5452 training
samples and 500 testing samples. 10 % of the training data was randomly
held out for validation.
– Stanford Sentiment Treebank datasets SST-1 and SST-2. SST-1 contains
one sentence movie reviews which are rated in the scale of 1-5 going from
positive to negative. The dataset has predefined train-test-dev split. It has
8544 training samples, 2210 testing samples, and 1101 validation samples.
SST-2 is a binary version of SST-1 where there are only two labels positive
and negative. It has 6920 training samples, 1821 testing samples, and 872
validation samples.
Original English versions of this dataset are translated to Hindi using Google
Translate. A language model was trained using Hindi wiki corpus and used to
filter out noisy sentences. We assume no out of vocabulary words as fast text
model generates word embeddings for unknown words as well. A common vo-
cabulary of 31k words is created and fast-text vectors are used to initialize the
embedding matrix.
4 Model Architectures
The data samples comprise of a sequence of words so different sequence process-
ing models are explored in this work. While the most natural sequence processing
model is LSTM, other models are equally applicable as the sequence length is
short.
– BOW: The bag of words model does not consider the sequence of words.
The word vectors of input sentence are averaged to get a sentence embedding
of size 300. This is followed by a dense layer of size equal to the number of
output classes. Softmax output is given to cross-entropy loss function and
Adam is used as an optimizer.
– BOW + Attention: In this model, instead of simply averaging, a weighted
average of word vectors is taken to generate sentence embedding. The size of
sentence embedding is 300 and is followed by a dense layer similar to BOW
model. The weights for the individual time step is learned by passing the
corresponding word vector through a linear layer of size 300 × 1. Softmax
over these computed weights gives the probabilistic attention scores. This
attention approach is described in [10].
– CNN: The sequence of word embeddings are passed through three 1D con-
volutions of kernel sizes 2, 3, and 4. Each convolution uses a filter size of
128. The output of each of the 1-D convolution is max pooled over time and
concatenated to get the sentence representation. The size of this sentence
representation is 384 dimensions. There is a final dense layer of size equal to
the number of output classes.
Deep Learning for Hindi Text Classification : A Comparison v
– LSTM: The word vectors are passed as input to two-layer stacked LSTM.
The output of the final time step is given as an input to a dense layer for
classification. LSTM cell size is 128 and the size of final time step output
which is treated as sentence representation is 128.
– Bi-LSTM: The sequence of word embedding is passed through two stacked
bi-directional LSTM. The output is max pooled over time and followed by a
dense layer of size equal to the number of output classes. LSTM cell size is 128
and the size of max-pooled output which is treated as sentence representation
is 256.
– CNN + Bi-LSTM: The sequence of word embeddings are passed through
a 1D convolution of kernel size 3 and filter size 256. The output is passed
through a bi-directional LSTM. The output of Bi-LSTM is max pooled over
time and followed by a final dense layer.
– Bi-LSTM + Attention: This is similar to Bi-LSTM model. The difference
is that instead of max-pooling over the output of Bi-LSTM an attention
mechanism is employed as described above.
– LASER and BERT: Single pre-trained model for learning multilingual
sentence representations in the form of BERT and LASER was released by
Google and Facebook respectively. BERT is a 12 layer transformer based
model trained on multilingual data of 104 languages. LASER a 5 layer Bi-
LSTM model pre-trained on multilingual data of 93 languages. Both of these
models have Hindi as one of the training languages. The sentence embeddings
extracted from these models are used without any fine-tuning or modifica-
tions. The pre-trained sentence embeddings are extracted from the corre-
sponding models and subjected to a dense layer of 512 units. It is further
connected to a dense layer of size equal to the number of output classes over
which softmax is computed. BERT generated 768-dimensional embedding
whereas the dimension of LASER embeddings were 1024.
word vectors are indicated as random for random initialization, fast text for
trainable fast-text initialization, and fast text-static for un-trainable fast text
initialization. Out of all the models vanilla CNN performs the best for all the
datasets. CNNs have known to perform best for short texts and same is visible
here as the datasets under consideration do not have long sentences. There is a
small difference in the performance of different LSTM model. However, Bi-LSTM
with max-pooling performed better than its attention version and unidirectional
LSTM. Bag of words based on attention fared better than the simple bag of
words model. Attention was particularly helpful with the usage of static fast text
word vectors. Stacked CNN-LSTM models were somewhere between LSTM and
CNN based models. We did not see a huge drop in performance due to random
initialization of word vectors. But the performance across different epochs was
very stable with fast text initialization. Finally, as compared to generic sentence
embeddings obtained from BERT and LASER, specific embeddings obtained
from custom models performed better. LASER was able to reach close to the
best performing model. This shows that LASER was able to capture important
discriminative features of a sentence required for the task at hand whereas BERT
failed to capture the same.
6 Conclusion
In this work, we compared different deep learning approaches for Hindi sentence
classification. The word vectors were initialized using fast text word vectors
trained on Hindi corpus and random word vectors. This work also serves the
evaluation of fast text word embeddings for Hindi sentence classification task.
CNN models perform better than LSTM based models on the datasets consid-
ered in this paper. Although we would expect BOW to perform the worst it
has numbers comparable to LSTM and CNN. Therefore if we can trade off ac-
curacy for speed BOW is useful. LSTMs do not do better than CNNs may be
because the word order is relaxed in Hindi. Sentence representations captured by
LASER multilingual model were rich as compared to BERT. However, overall
custom trained models on specific datasets performed better than lightweight
models directly utilizing sentence encodings. Although the real advantage of
multi-lingual embeddings can be better evaluated on tasks involving text from
multiple languages.
References
1. Arora, P.: Sentiment analysis for hindi language (2013)
2. Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464 (2018)
3. Conneau, A., Lample, G., Rinott, R., Williams, A., Bowman, S.R., Schwenk,
H., Stoyanov, V.: Xnli: Evaluating cross-lingual sentence representations. arXiv
preprint arXiv:1809.05053 (2018)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
5. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
classification. arXiv preprint arXiv:1607.01759 (2016)
6. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882 (2014)
7. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text
classification. In: Twenty-ninth AAAI conference on artificial intelligence (2015)
8. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP). pp. 1532–1543 (2014)
9. Tummalapalli, M., Chinnakotla, M., Mamidi, R.: Towards better sentence classifi-
cation for morphologically rich languages (2018)
10. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirec-
tional long short-term memory networks for relation classification. In: Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Vol-
ume 2: Short Papers). pp. 207–212 (2016)