0% found this document useful (0 votes)
53 views

Text-Based Classification

text base classification

Uploaded by

assad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Text-Based Classification

text base classification

Uploaded by

assad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2019 IEEE 4th International Conference on Computer and Communication Systems

Book Genre Classification Based on Titles with Comparative Machine


Learning Algorithms
Eran Ozsarfati Egemen Sahin
Robert College of Istanbul Nesibe Aydin School
Istanbul, Turkey Kocaeli, Turkey
e-mail: [email protected] e-mail: [email protected]

Can Jozef Saul Alper Yilmaz


Robert College of Istanbul Photogrammetric Computer Vision Laboratory
Istanbul, Turkey The Ohio State University
e-mail: [email protected] Columbus, OH, USA
e-mail: [email protected]

Abstract—This paper presents algorithmic comparisons for approach. Our dataset was retrieved from Amazon library,
producing a book’s genre based on its title. While some titles tokenized with the an algorithm and then converted to vector
are easy to interpret, some are irrelevant to the genre that they representations for every word.
belong to. Henceforth, we seek to determine the optimal and most
accurate method for accomplishing the task. Several data prepro- We present five machine learning algorithms for the men-
cessing steps were implemented, in which word embeddings were tioned task; Recurrent Neural Networks (RNN), Gated Re-
created to make the titles operable by the computer. Five different current Unit (GRU), Long Short-Term Memory (LSTM), Bi-
machine learning models were tested throughout the experiment. Directional LSTM (Bi-LSTM), Convolutional Neural Net-
Each different algorithm was fine-tuned for attaining the best
works (CNN), and Naive Bayes. Each algorithms’ hyper-
parameter values, while no modifications were conducted on the
dataset. The results indicate that the Long Short-Term Memory parameters were tuned and experiments were conducted. The
(LSTM) with a dropout is the top performing architecture among results indicate better performance with deep-learning meth-
the algorithms, with an accuracy of 65.58%. To the authors’ ods, specifically with LSTM, due to their ability to maintain
knowledge, no prior study has been done about book genre memory over long term dependencies. In our study, we explain
classification by title, therefore the present study is the current
our data preprocessing algorithms and experimental setup
best in the field.
for different machine learning algorithms, then examine our
Keywords—Machine Learning; Deep Learning; Long Short- results and end with conclusion and prospective future work.
Term Memory; Genre Classification; Book Title; Natural Lan-
guage Processing II. M ETHODS
I. I NTRODUCTION A. Data Preprocessing
Book titles play an important role in a book’s presentation. The dataset was made publicly available by Akshay
While the title might not always be well-indicative of a book’s Bhatina1 . It contains 207575 samples of data, with each title
topic or genre, it always stores clues in it. The aim of the study corresponding to one of the 32 different genres (Table I).
is to create a model that can determine the genre of a book Subsequently, the data was tokenized and normalized to create
by analyzing the title, as it is the first thing that attracts them, a custom dictionary among the unique words. The inputs were
thus helping stores analyze their book sales. converted to lowercase, and numbers and punctuation were
Multi-class classification experiments have been conducted deleted. English stopwords were then removed from the data
on numerous datasets, such as genre classification of movies using the Natural Language Toolkit (NLTK) stopwords dataset.
based on their titles using convolutional neural networks [1], The resulting data was separated into words and stemming, a
[2]. Gabriel et al. used the cover of movies alongside the title process in which derived words are converted to their roots,
and used only 10 genres. Ertugrul and Karagoz have used the was applied to the data, reducing the vocabulary size without
plot summaries of the movies and have restricted the genres losing information. Word embeddings are vectorized represen-
to only 4 categories. Our study will use a novel field, book tations of words. In our classification experiments, we used the
genres, use information restricted only to the title of the book word2vec algorithm [3], yielding 300-dimensional representa-
and also do it so extensively with 32 different genres. Up tion of words. The algorithm creates 300 dimensional vector
until the publication of these findings, there has not been any
attempt in classifying book-genres using a machine learning 1 https://github.com/akshaybhatia10/Book-Genre-Classification

978-1-7281-1322-7/19/$31.00 ©2019 IEEE 14


representations through context aware deep learning models,
N

where the number of occurrences and adjacent words are used.
The classifiers were fed with equal sized inputs, from which P (y | x1 . . . xn ) ∝ P (y) P (xi | y) (3)
i=1
we found the maximum length in the titles, which was 15,
and filled every title with length below 15 with paddings, In cases where the classification is multi-classed, the rela-
which are non-word representations of null. Then we applied tionship above could be rewritten to find the class y with the
one hot encoding, converting the labels into binary format. maximum probability:
For instance, a title with the class 2 would be represented
as follows 2 = {0, 0, 1, 0, . . . }. Figure 1 visualizes the N

preprocessing steps. y = arg max P (y) P (xi | y) (4)
n
i=1

In spite of their simplified assumptions, Naive Bayes algo-


rithms are quite successful in classification problems, requiring
a small amount of training data to predict the arguments [5].
They also separately calculate the class conditional densities
for each feature, reducing a multidimensional problem to a
single-dimensional estimation.
For text classification problems, two main models of Naive
Bayes are used: Bernoulli Naive Bayes and Multinomial
Figure 1. Data Preprocessing Pipeline Naive Bayes. The Bernoulli model implements the Naive
Bayes algorithm for a dataset distributed according to the
multivariate Bernoulli distributions, treating each feature as a
B. Classification Algorithms binary-valued variable. Multinomial Naive Bayes implements
We will explain the different algorithms and machine the multinomial distribution, implementing the frequency
learning tools used in our experimentations. Some algorithms of words. It also differs from the Bernoulli model by
are more traditional, while some are contemporary deep not penalizing the non-occurrence of features (words) [6].
learning algorithms. 300 dimensional word embeddings were Both models were implemented, and it was recorded that the
uniform throughout the experimentation. The inputs’ (words Multinomial model performed better than the Bernoulli model.
in the titles) corresponding vector representation were used Therefore, the Multinomial model was used throughout the
as input values for respective algorithms. While the models experimentation process.
with memories used concatenation for separate words in a
title, others had the entire title’s vector representation for the 2) Convolutional Neural Networks
input value. Convolutional Neural Networks (CNN) are powerful tools
for recognizing local patterns in data samples [7]. Our con-
1) Naive Bayes volutional neural network architecture performed quite well
The Naive Bayes is a probabilistic method based on the among the experimental architectures, as some word pairs or
Bayesian theorem that is used for classification problems groups of 3, 4 or 5 relate to the genre of the book.
[4]. Naive Bayes is a probabilistic method that is used for The CNN detects local patterns by creating feature maps,
classification problems. The algorithm is based on the Bayes which are created by conducting element wise multiplication
Theorem: with our kernel and the slided area of the input value. Then all
the values are summed, yielding a result for the feature map.
P (B | A) P (A) A 2D depiction of the process is summarized in Figure 2. In
P (A | B) = (1)
P (B) the study the convolution was formed on the 300-dimensional
vector, forming feature maps of the input words.
in which the probability of A occurring is calculated,
given that B has occurred. Given a set of features X =
{x1 , x2 , x3 , . . . xn }, the posterior probability for the class
label yi is constructed amongst the possible classes Y =
{y1 , y2 , y3 , . . . yn } using the theorem:

P (x1 | y)P (x2 | y) . . . P (xn | y) P (y)


P (y | x1 . . . xn ) =
P (x1 )P (x2 ) . . . P (xn )
(2)
The assumption with the theorem, and thus the algorithm, is
that the events (features) are independent of each other. Thus,
the equation could be rewritten as: Figure 2. Feature map formulation depicted in two dimensions

15
TABLE I
G ENRE D ISTRIBUTION IN THE DATASET

Genre Percentile Genre Percentile


Arts & Photography 3.11 Business & Money 4.80
Biographies & Memoirs 2.05 Calendars 1.26
Computers & Technology 3.84 Crafts, Hobbies & Home 4.78
Cookbooks, Food & Wine 4.24 Christian Books & Bibles 4.40
History 3.27 Law 3.52
Humor & Entertainment 3.32 Literature & Fiction 3.65
Parenting & Relationships 1.21 Romance 2.06
Politics & Social Sciences 1.63 Science & Math 4.46
Reference 1.57 Science Fiction & Fantasy 1.83
Religion & Spirituality 3.64 Self-Help 1.30
Gay & Lesbian 0.64 Test Preparation 1.39
Education & Teaching 0.80 Travel 8.83
Sports & Outdoors 2.87 Medical Books 5.82
Teen & Young Adult 3.60 Mystery, Thriller & Suspense 0.96
Engineering & Transportation 1.28 Children’s Books 6.55
Health, Fitness & Dieting 5.72 Comics & Graphic Novels 1.45

Stride can be an additional parameter, modifying the number to dropping out of hidden units (neurons), which can be
of values the network passes between kernel sliding. Addi- visualized as follows:
tionally, the padding function may be used to control the
size of the feature maps by adding zeros around the input
value. After a feature map is captured, a nonlinear function is
applied, converting every negative value to 0 and maintaining
all positive values as they are. The mentioned function is the
Rectified Linear Unit (ReLU) function: y = max(0, value).
Non-linearity is essential as the data can’t merely be described
with linear functions.
Then the pooling function is applied, which comes in varia-
tions of maximum, average, sum pooling. In our architecture, Figure 3. Dropout in Neural Network
max pooling is utilized as it has been found more effective
in previous studies [8]. Max pooling reduces the dimensions We used the Adam optimizer for our network, which enables
of the feature map while maintaining the most important rapid convergence compared to other optimizers [10]. Categor-
identity values through sliding kernels over the rectified feature ical cross entropy function was used as our loss function (6).
map and capturing the highest values. This makes the data Sequentially, filter sizes of 3, 4 and 5 were used with a dropout
more manageable with less parameters as the dimensions are rate of 0.4. The training data was trained with 20 epochs, a
reduced. For the following steps, the output is flattened and batch size of 45 and a learning rate of 0.001. Additionally,
converted to one long vector which will be crucial for the clas- padding was used for better feature detection.
sification algorithms. After this point, a regular feed forward 
back-propagation neural network methodology is applied. A CCE(p, q) = − p(x) log(q(x)) (6)
fully connected layer is applied to the features detected from x
the prior steps, which calculates the probabilities for different 3) Recurrent Neural Network
classes. Softmax function (5) is used to obtain the probability
A Recurrent Neural Network (RNN) is a machine learning
distribution. After obtaining an output, the loss function is
model that has great usage in many Natural Language Pro-
calculated. Finally, the network is back propagated based on
cessing (NLP) tasks. In traditional neural networks, the inputs
the selected optimizer function to adjust the weights.
are independent from each other, while RNNs can understand
sequential information by taking the output of the previous
eWr+b layer as the input[11] . This allows them to have memory about
pc =  L W (5)
i=1 e
ir+bi what has been calculated so far, making them useful in tasks
such as word prediction, where previous words are required in
For our network, we used three convolutional layers and order to make a good prediction. Recurrent Neural Networks
three max pooling layers following the convolution. We added are made up of multiple RNN cells which are connected to
a dropout layer after the softmax function to prevent over- each other. An unfolded RNN is modeled in Figure 4 where
fitting. Dropout is a regularization technique preventing the xt represents the input at time step t, st consists of the
network from memorizing a specific dataset and rather en- hidden layers at time step t, and is calculated based on the
abling it to adapt to variant inputs [9]. The term refers previous output and the input of the current step. st captures

16
information about what happened in all of the previous time sign) activation function is applied to compress the result
steps using the function: st = f (U xt +Wst−1 ) ot is the output between 0 and 1. The update gate helps the model determine
for step t. U , V and W are the parameters, shared across all how much of the past information should be passed on to the
steps. output. The reset gate rt for time step t can be calculated
using the formula:

rt = σ(W (r) xt + U (r) ht−1 ) (9)


This formula is the same as the formula for the update gate,
only with the weights being those for the update gate. The
main idea behind the GRU is that it combines the reset and
update gate into one update gate using the formula:

ht = tanh(W xt + rt · U ht−1 ) (10)


where the input xt is multiplied with a weight Wz and ht−1
is multiplied with a weight Uz . The dot product between rt
Figure 4. An Unfolded RNN model showing the memory cells and U ht−1 is calculated, determining what to remove from
the previous steps. Then, the results from both operations are
In theory, RNNs can capture long term dependencies;
added, and finally the tanh activation function is applied.
however in practice, they fail to do so. The reason RNNs
The same parameters from the RNN model were applied
can only look back a few steps is because of a difficulty
to the GRU model.
called the vanishing gradient problem. This problem arises
from the fact that common activation functions such as the
5) Long Short-Term Memory
hyperbolic tangent (tanh) or sigmoid map the real number
Long short-term memory (LSTM) architectures are more
line onto a range of [0, 1]. This results in a lot of the inputs
complex, advanced versions of RNN. LSTMs show success
being mapped to small values. In these regions, even large
in long term dependencies where the network is ought to
changes to the input produce small changes in the output. This
remember data over a long period for creating more optimized
becomes even worse when stacking multiple layers, resulting
layers [13]. Unlike a regular RNN, which has one gate, there
in an exponentially diminishing gradient.
are four interacting gates with respective functionalities. The
Gated Recurrent Unit and Long Short-Term Memory net-
network takes the previous hidden state and the current input
works are variants of RNNs which are able to solve this
as the input values. The key concept behind LSTM is the
problem.
cell state, which is the horizontal line at the top in Figure
In this study, an RNN was used with 2 layers, the first one
5. The line carries the hidden state, while certain operations
with 128 units and the second with 64. tanh (7) was used
are executed on it through the gated cells. σ represents the
as the activation function, and a dropout rate of 0.5 was used
sigmoid function (14). The X in a circle represents point-
after both layers. The optimizer, RMSprop was used as it is
wise multiplication while the + inside the circle represents
the most efficient optimizer for RNNs. The training data was
point-wise addition. The initial sigma gate is the forget gate,
trained in 50 epochs with a batch size of 25 and a learning
deciding whether to keep information or not. As the sigmoid
rate of 0.001.
function yields binary output, there is a firm decision. Equation
1 − e−2x 11 describes the process:
y= (7)
1 + e−2x
ft = σ(Wf [ht−1 , xt ] + bf ) (11)
4) Gated Recurrent Unit
Gated Recurrent Units (GRU) are variants of RNNs, that where W represents network weights, h represents the
aim to solve the vanishing gradient problem. They do this hidden state value, x represents the current input value, and b
by using update and reset gates [12]. These gates decide what represents the network bias. These mentioned representations
information should be passed on and which ones should not be. will be uniform throughout the following equations that regard
This allows them to keep information from long ago without LSTM. The next section adds new information to the cell state
washing out. The update gate for time step t can be calculated through point multiplication of results coming from the tanh
using the formula: (7) and sigmoid (14) function. The result is added to the cell
state. Equations 12 and 13 describes the process.
zt = σ(W (z) xt + U (z) ht−1 ) (8)
it = σ(Wi [ht−1 , xt ] + bi ) (12)
where xt , the input, is multiplied by its own weight Wz . hz1 ,
the previous output, is also multiplied by its own weight Uz .
Both of the results are added together and a sigmoid (sigmoid C̃t = tanh(Wc [ht−1 , xt ] + bc ) (13)

17
III. R ESULTS AND D ISCUSSION
1
φ(x) = (14) Table II summarizes our quantitative findings from experi-
1 + e−x mentation.
The results of the previously mentioned operations are The results show a gradual improvement in the line of
applied on the cell state through the following equation: RNN variants, with LSTM being the top ranking model. As
mentioned in the previous section, RNNs face the vanishing
gradient problem. The mentioned issue is resolved through
Ct = ft Ct−1 + it C̃t (15) longer cell memories. While the GRU has more filtration
operations and a longer memory than the RNN, LSTMs addi-
The final step yields an output based on the current cell state
tional binary classification before the tanh function increases
alongside with a final filtration. Sigmoid function is point-wise
the complexity. Although there are minor differences between
multiplied with the result of tanh function applied on the cell
GRU and LSTM architectures, there is a 7.3% difference
state. The following equations describe the mentioned process:
between their accuracies, indicating that the LSTM is a more
efficient architecture than GRU. A higher accuracy would
ot = σ(Wo [ht−1 , xt ] + bo ) (16) be an indication of better performance. Seeing as the other
evaluation parameters are quite numerically close to accuracy,
accuracy by itself would be a sufficient criterion for perfor-
ht = ot tanh(Ct ) (17) mance comparison.

TABLE II
E XPERIMENTAL R ESULTS FROM T ESTED M ACHINE L EARNING
A LGORITHMS

Algorithm Accuracy Precision Recall F-score

LSTM 65.58% 64.18% 63.92% 64.05%


GRU 58.28% 59.57% 59.30% 59.43%
RNN 55.91% 54.24% 53.97% 54.10%
CNN 63.10% 59.86% 59.72% 59.79%
Naive Bayes 55.40% 55.40% 55.73% 55.56%
Bi-LSTM 64.28% 64.12% 63.73% 63.92%

Figure 5. LSTM Cell Illustration


Although LSTM and Bi-LSTM conduct similar operations,
6) Bi-Directional LSTM LSTM yielded a higher accuracy due to the short length of
This section will not explain Bidirectional-LSTM from its book titles, unlike other multi-class classification studies [15].
fundamental equations as Bi-LSTM shares the same fun- Because our dataset contained titles only in English language,
damental principles with LSTM, the only difference being the Bi-LSTM was unable take advantage of reverse reading.
the fact that Bi-LSTM reads the input from Wn to Wi as Further research could solve this issue by using Bi-LSTMs in
well as Wi to Wn [14]. Therefore, a Bi-LSTM is comprised datasets which have languages that are read right-to-left, such
of one forward LSTM and one backward LSTM. Bi-LSTM as Hebrew or Arabic. Additionally, for datasets that contain
concatenates the final hidden states (phase before the linear longer titles, Bi-LSTM would be a more suitable architecture
step) and then applies the softmax function to get an output. as there will be more values the network will parse through.
The Bi-LSTM architecture can be summarized with Figure 6. LSTM showed better performance over the other recur-
rent units due to the mathematical operations throughout its
execution. The simple RNN merely contains a memory of
the previous input, whose equation for attaining the memory
can be seen in the methodology section. On the other hand,
equations from 8 to 17 allow the model to reach even to the
initial input for obtaining a result. Additionally, the mentioned
equations are used as decision making tools for elimination
certain output possibilities, further reinforcing the algorithms.
On the other hand, the high performance rate of the CNN
shows its similarity to the LSTM architecture, with its ability
to store information. As kernels with fixed dimensions are used
Figure 6. Bidirectional-LSTM Architecture for finding feature maps, the CNN architecture was able to find
local patterns in titles. Such patterns might be exemplified with

18
correlated word pairs. As kernel sizes of 3x3, 4x4, 5x5 were able to find local patterns. As our input values contained
used for constructing feature maps, the results indicated the samples with a maximum length of 15 and our kernel sizes
existence of relations between multiple words in a title. LSTM included ones with 5 dimension, CNN performed well in our
slightly outperforms CNN as CNN is merely able to detect experimentation as well.
local patterns in a title, while LSTM takes the entire input to
its memory, enabling it to find more vast scaled patterns in IV. C ONCLUSION
titles. In the present study, we present multiple machine learning
While we were able to attain high accuracies for com- algorithms for classifying book genres based on their titles.
paratively similar tasks [15], a fundamental issue lies in our We conducted experiments with different parameters on
dataset: titles being short phrases. As mentioned in the previ- Vanilla Recurrent Neural Networks, Gated Recurrent Unit,
ous section, we had a maximum title length of 15. Another Long Short-Term Memory, Convolutional Neural Networks
factor that made it hard to find patterns in the dataset was the and Naive Bayes. Our results indicate a better performance
high amounts of padding which was present throughout the overall by deep learning architectures (CNN, RNN, LSTM,
input values. GRU, Bi-LSTM). LSTM’s ability to store information for long
Figure 7 summarizes the results from the LSTM architecture term dependencies to be applicable for short inputs like titles
with a confusion matrix. The figure’s x axis shows the made it the most accurate model in our experimentation pool.
predicted values, while the y axis shows the actual values. The Although the accuracies aren’t very high, the paper presents
confusion matrix indicates that political science and thriller the first instance of a book genre classification approach only
books are most likely to be confused. Conversely, books in the based on title. The models at hand can be further improved
Calendar genre are the least likely to be confused, examination with the use of attention mechanism on LSTM’s outputs for
of the dataset shows similar results as all of them contain the detecting the most significant values in an input. Additional
word “calendar” in their title. architectures like Support Vector Machine (SVM) can be
experimented as well.
R EFERENCES
[1] R. C. B. Gabriel S. Simes, Jnatas Wehrmann and D. D. Ruizl. Movie
genre classification with convolutional neural networks. IJCNN, 2016.
[2] Ertugrul, A. M., & Karagoz, P. (2018, January). Movie Genre Classi-
fication from Plot Summaries Using Bidirectional LSTM. In Semantic
Computing (ICSC), 2018 IEEE 12th International Conference on (pp.
248-251). IEEE.
[3] Goldberg, Yoav, and Omer Levy. ”word2vec Explained: deriving Mikolov
et al.’s negative-sampling word-embedding method.” arXiv preprint
arXiv:1402.3722 (2014).
[4] Murphy, K. P. (2006). Naive bayes classifiers. University of British
Columbia, 18.
[5] Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M. A., & Strachan,
R. (2014). Hybrid decision tree and nave Bayes classifiers for multi-class
classification tasks. Expert Systems with Applications, 41(4), 1937-1946.
[6] McCallum, A., & Nigam, K. (1998, July). A comparison of event models
for naive bayes text classification. In AAAI-98 workshop on learning for
text categorization (Vol. 752, No. 1, pp. 41-48).
[7] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi-
Figure 7. Confusion Matrix yielded from LSTM testing fication with deep convolutional neural networks. In Advances in neural
information processing systems (pp. 1097-1105).
The Naive Bayes algorithm performed relatively well, with [8] Wu, H., & Gu, X. (2015, November). Max-pooling dropout for regular-
an accuracy of 55.40%, even though it is the most simple ization of convolutional neural networks. In International Conference on
Neural Information Processing (pp. 46-54). Springer, Cham.
algorithm among the ones used in the study. The unique [9] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever,
probabilistic approach of the algorithm has been seen to and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent
be effective in previous Natural Language Processing (NLP) neural networks from overfitting. Journal of Machine Learning Research,
15(1):19291958
related studies [16]. [10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic
The results from deep learning methods yielded the fol- optimization. arXiv preprint arXiv:1412.6980.
lowing order of increasing accuracies: RNN, GRU, CNN, [11] Grossberg, Stephen. ”Recurrent neural networks.” Scholarpedia 8.2
(2013): 1888.
Bi-LSTM and LSTM. In a research with another dataset, [12] Chung, Junyoung, et al. ”Gated feedback recurrent neural networks.”
Microsoft conducted similar experimentation with a different International Conference on Machine Learning. 2015.
dataset [17]. Their research also indicated LSTM’s capabilities [13] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Neural computation, 9(8), 1735-1780.
in long term dependencies and CNN’s ability to recognize [14] Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models
local patterns in phrases. The minor difference between LSTM for sequence tagging. arXiv preprint arXiv:1508.01991.
and CNNs have also been found in another research [18]. [15] Baziotis, C., Athanasiou, N., Paraskevopoulos, G., Ellinas, N., Kolovou,
A., & Potamianos, A. (2018). NTUA-SLP at SemEval-2018 Task 2:
Such a difference originates from LSTM’s ability to maintain Predicting Emojis using RNNs with Context-aware Attention. arXiv
memory over the entire input value, while CNN is merely preprint arXiv:1804.06657.

19
[16] Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998, November).
Inductive learning algorithms and representations for text categorization.
In Proceedings of the seventh international conference on Information
and knowledge management (pp. 148-155). ACM.
[17] B. Athiwaratkun and J. W. Stokes, ”Malware classification with LSTM
and GRU language models and a character-level CNN,” 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), New Orleans, LA, 2017, pp. 2482-2486.
[18] W. Byeon, T. M. Breuel, F. Raue and M. Liwicki, ”Scene labeling with
LSTM recurrent neural networks,” 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3547-
3555.

20

You might also like