Text-Based Classification
Text-Based Classification
Abstract—This paper presents algorithmic comparisons for approach. Our dataset was retrieved from Amazon library,
producing a book’s genre based on its title. While some titles tokenized with the an algorithm and then converted to vector
are easy to interpret, some are irrelevant to the genre that they representations for every word.
belong to. Henceforth, we seek to determine the optimal and most
accurate method for accomplishing the task. Several data prepro- We present five machine learning algorithms for the men-
cessing steps were implemented, in which word embeddings were tioned task; Recurrent Neural Networks (RNN), Gated Re-
created to make the titles operable by the computer. Five different current Unit (GRU), Long Short-Term Memory (LSTM), Bi-
machine learning models were tested throughout the experiment. Directional LSTM (Bi-LSTM), Convolutional Neural Net-
Each different algorithm was fine-tuned for attaining the best
works (CNN), and Naive Bayes. Each algorithms’ hyper-
parameter values, while no modifications were conducted on the
dataset. The results indicate that the Long Short-Term Memory parameters were tuned and experiments were conducted. The
(LSTM) with a dropout is the top performing architecture among results indicate better performance with deep-learning meth-
the algorithms, with an accuracy of 65.58%. To the authors’ ods, specifically with LSTM, due to their ability to maintain
knowledge, no prior study has been done about book genre memory over long term dependencies. In our study, we explain
classification by title, therefore the present study is the current
our data preprocessing algorithms and experimental setup
best in the field.
for different machine learning algorithms, then examine our
Keywords—Machine Learning; Deep Learning; Long Short- results and end with conclusion and prospective future work.
Term Memory; Genre Classification; Book Title; Natural Lan-
guage Processing II. M ETHODS
I. I NTRODUCTION A. Data Preprocessing
Book titles play an important role in a book’s presentation. The dataset was made publicly available by Akshay
While the title might not always be well-indicative of a book’s Bhatina1 . It contains 207575 samples of data, with each title
topic or genre, it always stores clues in it. The aim of the study corresponding to one of the 32 different genres (Table I).
is to create a model that can determine the genre of a book Subsequently, the data was tokenized and normalized to create
by analyzing the title, as it is the first thing that attracts them, a custom dictionary among the unique words. The inputs were
thus helping stores analyze their book sales. converted to lowercase, and numbers and punctuation were
Multi-class classification experiments have been conducted deleted. English stopwords were then removed from the data
on numerous datasets, such as genre classification of movies using the Natural Language Toolkit (NLTK) stopwords dataset.
based on their titles using convolutional neural networks [1], The resulting data was separated into words and stemming, a
[2]. Gabriel et al. used the cover of movies alongside the title process in which derived words are converted to their roots,
and used only 10 genres. Ertugrul and Karagoz have used the was applied to the data, reducing the vocabulary size without
plot summaries of the movies and have restricted the genres losing information. Word embeddings are vectorized represen-
to only 4 categories. Our study will use a novel field, book tations of words. In our classification experiments, we used the
genres, use information restricted only to the title of the book word2vec algorithm [3], yielding 300-dimensional representa-
and also do it so extensively with 32 different genres. Up tion of words. The algorithm creates 300 dimensional vector
until the publication of these findings, there has not been any
attempt in classifying book-genres using a machine learning 1 https://github.com/akshaybhatia10/Book-Genre-Classification
15
TABLE I
G ENRE D ISTRIBUTION IN THE DATASET
Stride can be an additional parameter, modifying the number to dropping out of hidden units (neurons), which can be
of values the network passes between kernel sliding. Addi- visualized as follows:
tionally, the padding function may be used to control the
size of the feature maps by adding zeros around the input
value. After a feature map is captured, a nonlinear function is
applied, converting every negative value to 0 and maintaining
all positive values as they are. The mentioned function is the
Rectified Linear Unit (ReLU) function: y = max(0, value).
Non-linearity is essential as the data can’t merely be described
with linear functions.
Then the pooling function is applied, which comes in varia-
tions of maximum, average, sum pooling. In our architecture, Figure 3. Dropout in Neural Network
max pooling is utilized as it has been found more effective
in previous studies [8]. Max pooling reduces the dimensions We used the Adam optimizer for our network, which enables
of the feature map while maintaining the most important rapid convergence compared to other optimizers [10]. Categor-
identity values through sliding kernels over the rectified feature ical cross entropy function was used as our loss function (6).
map and capturing the highest values. This makes the data Sequentially, filter sizes of 3, 4 and 5 were used with a dropout
more manageable with less parameters as the dimensions are rate of 0.4. The training data was trained with 20 epochs, a
reduced. For the following steps, the output is flattened and batch size of 45 and a learning rate of 0.001. Additionally,
converted to one long vector which will be crucial for the clas- padding was used for better feature detection.
sification algorithms. After this point, a regular feed forward
back-propagation neural network methodology is applied. A CCE(p, q) = − p(x) log(q(x)) (6)
fully connected layer is applied to the features detected from x
the prior steps, which calculates the probabilities for different 3) Recurrent Neural Network
classes. Softmax function (5) is used to obtain the probability
A Recurrent Neural Network (RNN) is a machine learning
distribution. After obtaining an output, the loss function is
model that has great usage in many Natural Language Pro-
calculated. Finally, the network is back propagated based on
cessing (NLP) tasks. In traditional neural networks, the inputs
the selected optimizer function to adjust the weights.
are independent from each other, while RNNs can understand
sequential information by taking the output of the previous
eWr+b layer as the input[11] . This allows them to have memory about
pc = L W (5)
i=1 e
ir+bi what has been calculated so far, making them useful in tasks
such as word prediction, where previous words are required in
For our network, we used three convolutional layers and order to make a good prediction. Recurrent Neural Networks
three max pooling layers following the convolution. We added are made up of multiple RNN cells which are connected to
a dropout layer after the softmax function to prevent over- each other. An unfolded RNN is modeled in Figure 4 where
fitting. Dropout is a regularization technique preventing the xt represents the input at time step t, st consists of the
network from memorizing a specific dataset and rather en- hidden layers at time step t, and is calculated based on the
abling it to adapt to variant inputs [9]. The term refers previous output and the input of the current step. st captures
16
information about what happened in all of the previous time sign) activation function is applied to compress the result
steps using the function: st = f (U xt +Wst−1 ) ot is the output between 0 and 1. The update gate helps the model determine
for step t. U , V and W are the parameters, shared across all how much of the past information should be passed on to the
steps. output. The reset gate rt for time step t can be calculated
using the formula:
17
III. R ESULTS AND D ISCUSSION
1
φ(x) = (14) Table II summarizes our quantitative findings from experi-
1 + e−x mentation.
The results of the previously mentioned operations are The results show a gradual improvement in the line of
applied on the cell state through the following equation: RNN variants, with LSTM being the top ranking model. As
mentioned in the previous section, RNNs face the vanishing
gradient problem. The mentioned issue is resolved through
Ct = ft Ct−1 + it C̃t (15) longer cell memories. While the GRU has more filtration
operations and a longer memory than the RNN, LSTMs addi-
The final step yields an output based on the current cell state
tional binary classification before the tanh function increases
alongside with a final filtration. Sigmoid function is point-wise
the complexity. Although there are minor differences between
multiplied with the result of tanh function applied on the cell
GRU and LSTM architectures, there is a 7.3% difference
state. The following equations describe the mentioned process:
between their accuracies, indicating that the LSTM is a more
efficient architecture than GRU. A higher accuracy would
ot = σ(Wo [ht−1 , xt ] + bo ) (16) be an indication of better performance. Seeing as the other
evaluation parameters are quite numerically close to accuracy,
accuracy by itself would be a sufficient criterion for perfor-
ht = ot tanh(Ct ) (17) mance comparison.
TABLE II
E XPERIMENTAL R ESULTS FROM T ESTED M ACHINE L EARNING
A LGORITHMS
18
correlated word pairs. As kernel sizes of 3x3, 4x4, 5x5 were able to find local patterns. As our input values contained
used for constructing feature maps, the results indicated the samples with a maximum length of 15 and our kernel sizes
existence of relations between multiple words in a title. LSTM included ones with 5 dimension, CNN performed well in our
slightly outperforms CNN as CNN is merely able to detect experimentation as well.
local patterns in a title, while LSTM takes the entire input to
its memory, enabling it to find more vast scaled patterns in IV. C ONCLUSION
titles. In the present study, we present multiple machine learning
While we were able to attain high accuracies for com- algorithms for classifying book genres based on their titles.
paratively similar tasks [15], a fundamental issue lies in our We conducted experiments with different parameters on
dataset: titles being short phrases. As mentioned in the previ- Vanilla Recurrent Neural Networks, Gated Recurrent Unit,
ous section, we had a maximum title length of 15. Another Long Short-Term Memory, Convolutional Neural Networks
factor that made it hard to find patterns in the dataset was the and Naive Bayes. Our results indicate a better performance
high amounts of padding which was present throughout the overall by deep learning architectures (CNN, RNN, LSTM,
input values. GRU, Bi-LSTM). LSTM’s ability to store information for long
Figure 7 summarizes the results from the LSTM architecture term dependencies to be applicable for short inputs like titles
with a confusion matrix. The figure’s x axis shows the made it the most accurate model in our experimentation pool.
predicted values, while the y axis shows the actual values. The Although the accuracies aren’t very high, the paper presents
confusion matrix indicates that political science and thriller the first instance of a book genre classification approach only
books are most likely to be confused. Conversely, books in the based on title. The models at hand can be further improved
Calendar genre are the least likely to be confused, examination with the use of attention mechanism on LSTM’s outputs for
of the dataset shows similar results as all of them contain the detecting the most significant values in an input. Additional
word “calendar” in their title. architectures like Support Vector Machine (SVM) can be
experimented as well.
R EFERENCES
[1] R. C. B. Gabriel S. Simes, Jnatas Wehrmann and D. D. Ruizl. Movie
genre classification with convolutional neural networks. IJCNN, 2016.
[2] Ertugrul, A. M., & Karagoz, P. (2018, January). Movie Genre Classi-
fication from Plot Summaries Using Bidirectional LSTM. In Semantic
Computing (ICSC), 2018 IEEE 12th International Conference on (pp.
248-251). IEEE.
[3] Goldberg, Yoav, and Omer Levy. ”word2vec Explained: deriving Mikolov
et al.’s negative-sampling word-embedding method.” arXiv preprint
arXiv:1402.3722 (2014).
[4] Murphy, K. P. (2006). Naive bayes classifiers. University of British
Columbia, 18.
[5] Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., & Strachan,
R. (2014). Hybrid decision tree and nave Bayes classifiers for multi-class
classification tasks. Expert Systems with Applications, 41(4), 1937-1946.
[6] McCallum, A., & Nigam, K. (1998, July). A comparison of event models
for naive bayes text classification. In AAAI-98 workshop on learning for
text categorization (Vol. 752, No. 1, pp. 41-48).
[7] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi-
Figure 7. Confusion Matrix yielded from LSTM testing fication with deep convolutional neural networks. In Advances in neural
information processing systems (pp. 1097-1105).
The Naive Bayes algorithm performed relatively well, with [8] Wu, H., & Gu, X. (2015, November). Max-pooling dropout for regular-
an accuracy of 55.40%, even though it is the most simple ization of convolutional neural networks. In International Conference on
Neural Information Processing (pp. 46-54). Springer, Cham.
algorithm among the ones used in the study. The unique [9] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever,
probabilistic approach of the algorithm has been seen to and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent
be effective in previous Natural Language Processing (NLP) neural networks from overfitting. Journal of Machine Learning Research,
15(1):19291958
related studies [16]. [10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic
The results from deep learning methods yielded the fol- optimization. arXiv preprint arXiv:1412.6980.
lowing order of increasing accuracies: RNN, GRU, CNN, [11] Grossberg, Stephen. ”Recurrent neural networks.” Scholarpedia 8.2
(2013): 1888.
Bi-LSTM and LSTM. In a research with another dataset, [12] Chung, Junyoung, et al. ”Gated feedback recurrent neural networks.”
Microsoft conducted similar experimentation with a different International Conference on Machine Learning. 2015.
dataset [17]. Their research also indicated LSTM’s capabilities [13] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Neural computation, 9(8), 1735-1780.
in long term dependencies and CNN’s ability to recognize [14] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models
local patterns in phrases. The minor difference between LSTM for sequence tagging. arXiv preprint arXiv:1508.01991.
and CNNs have also been found in another research [18]. [15] Baziotis, C., Athanasiou, N., Paraskevopoulos, G., Ellinas, N., Kolovou,
A., & Potamianos, A. (2018). NTUA-SLP at SemEval-2018 Task 2:
Such a difference originates from LSTM’s ability to maintain Predicting Emojis using RNNs with Context-aware Attention. arXiv
memory over the entire input value, while CNN is merely preprint arXiv:1804.06657.
19
[16] Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998, November).
Inductive learning algorithms and representations for text categorization.
In Proceedings of the seventh international conference on Information
and knowledge management (pp. 148-155). ACM.
[17] B. Athiwaratkun and J. W. Stokes, ”Malware classification with LSTM
and GRU language models and a character-level CNN,” 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), New Orleans, LA, 2017, pp. 2482-2486.
[18] W. Byeon, T. M. Breuel, F. Raue and M. Liwicki, ”Scene labeling with
LSTM recurrent neural networks,” 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3547-
3555.
20