Book Genre Classification Using ML
Book Genre Classification Using ML
Artificial intelligence has become the newest data science powerhouse in the modern
era.Since its inception, using Machine Learning, Deep Learning, and Computer Vision
algorithms in data analytics has gained popularity. The use of Logistic Regression, Sup-
port Vector Machine, Naive Bayes Multinomial, Multilayer Perceptrons, and Convolution
Neural Network to classify genre and examine the performance of these models on Bangla
book text has not yet been investigated. Hence, in this paper, we have proposed several
machine learning and neural network-based model construction in this work in order to
identify 6 sorts of genre categories of Bengali books data. We have evaluated the over-
all accuracy levels of Logistic Regression, SVM, Naive Bayes Multinomial, MLP, CNN
constructing models. Of all of them, SVM and Naive Bayes Multinomial have attained
relatively high accuracy.
i
Declaration
I, hereby, declare that the work presented in this Thesis is the outcome of the investigation
performed by us under the supervision of Tamanna Haque Nipa, Assistant Professor, De-
partment of Computer Science & Engineering, Stamford University Bangladesh. We also
declare that no part of this Thesis and thereof has been or is being submitted elsewhere
for the award of any degree or Diploma.
...........................................
Ahmed Shafi Arnob
Date:
...........................................
Tamanna Haque Nipa
Date:
ii
Dedicated to ...
I would like to dedicate this thesis to my beloved parents and teachers.
Acknowledgments
I would like to begin by expressing our gratitude to our thesis supervisor, Tamanna Haque
Nipa, Asst. Professor, Department of Computer Science and Engineering, Stamford Uni-
versity Bangladesh. Throughout our research and writing process, she consistently pro-
vided us with valuable guidance and support. Her office was always open to address any
queries or challenges we encountered. We sincerely appreciate her mentor-ship and the
way he allowed us to take ownership of this paper while guiding us in the right direction.In
this thesis, she was a never-ending source of inspiration, motivation, and encouragement.
Furthermore, we are deeply grateful to the faculties, friends, and family members associ-
ated with Stamford University Bangladesh. Their influence, support, and encouragement
have played a vital role in our journey. We acknowledge that without their appreciation,
guidance, and assistance, this accomplishment would not have been possible. They have
been a constant source of inspiration and motivation throughout this thesis endeavor. In
conclusion, we sincerely thank everyone who has contributed to our research and thesis,
ensuring its successful completion.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
1: Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aim of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2: Literature Review 5
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 In Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 In Bangla Language . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3: Methodology 8
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Term Frequency - Inverse Document Frequency (TF-IDF) . . . . . 13
3.4.2 Bag of Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.1 Naïve Bayes Multinomial . . . . . . . . . . . . . . . . . . . . . . 13
3.5.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 14
3.5.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.4 Multi-layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 17
3.5.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Evaluation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.5 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4: Experimental Evaluation 21
4.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Naïve Bayes Multinomial . . . . . . . . . . . . . . . . . . . . . . 24
4.1.4 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.6 Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Comparing with Similar Papers . . . . . . . . . . . . . . . . . . . . . . . 31
5: Conclusion 32
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References 34
vi
List of Figures
vii
List of Tables
viii
1 Introduction
As the world becomes increasingly reliant on technology, the fusion of artificial and hu-
man brains has pushed the boundaries of cognitive capabilities. This synergy has given
rise to Artificial Intelligence (AI), a concept that has gained immense popularity globally.
Within the realm of AI, Natural Language Processing (NLP) holds particular prominence,
especially in the field of text classification research. NLP serves as a prominent method
for searching, analyzing, comprehending, and deriving insights from text-based data.
The diversity of human languages further complicates matters, with individuals express-
ing themselves in their native tongues such as Spanish, Bangla, Chinese, and English.
However, it falls upon computers to dissect and interpret the meanings within these texts.
NLP empowers computers to extract meaningful information with greater sophistication.
Over recent years, the prevalence of NLP has surged. This technique finds application in
various domains including text classification, information extraction, speech tagging, and
more.
The number of people using smartphones and other digital devices is rising quickly, and
so is the number of people using the internet [1]. As a result, consumption of digital
content has substantially increased recently, and reading books is no exception. The day
when people read actual books is long past. Nowadays people read books on their smart-
phones or eReaders like Amazon Kindle. Books from all languages are read on these
devices. Bangla is no exception. Due to the growth of Bangla book reading on digital
devices, publishers now publishing their books digitally so the readers can easily read the
books on their favourite devices. A large number of eBooks are published daily. So, the
extensive and increased electronic availability of Bangla text documents enhances the ne-
cessity of automatic methods to analyze those text documents content to serve the readers
of those eBooks well. If Books are categorized according to their appropriate genre, then
it will be quick and efficient to search and retrieve information. Also we can recommend
1
readers books based on the genre of books they are reading. There are some other usage
of text classification besides books classification such as email filtering, spam detection,
sentiment analysis etc.
In the field of text categorization in languages other than English, there is sometimes
a scarcity of large datasets for training models, which hamper the achievement of superior
results. This is especially true for the Bengali language, where we found a lack of datasets
relevant to Bengali books. As a result, we set about creating our own dataset to help with
my research.
Among NLP experts, supervised classification algorithms have been a common choice
for problems involving document classification [2][3]. Furthermore, recent work on text
classification using neural networks (NN) using word vectors has produced outstanding
results [4]. In this study, we conduct a thorough examination of the difficulties asso-
ciated with identifying textual data in Bengali. My dataset has 118 examples from six
different genres. We used Term Frequency-Inverse Document Frequency (TF-IDF) and
Bag of Words (BoW) as feature selection technique is designed aiming to classify Bangla
electric text documents. Then we ran comparison experiments on numerous models, in-
cluding Logistic Regression, Nave Bayes Multinomial, Support Vector Machine (SVM),
Multilayer Perceptron (MLP), and Convolutional Neural Networks (CNN).
The technique of text classification is utilized to provide a broad view of the system. In
addition to other things, it is quite helpful for filtering emails and managing Web content,
search engines, and other things. Technology development has increased interest in text
categorization issues. On the internet, there are many Bangla language documents that are
both valuable and challenging to adequately categorize into their corresponding semantic
categories. If documents are organized into their appropriate categories, searching and
information retrieval are quick and simple. Furthermore, a reader prefers to reads books
which is most interesting to him/her from screen Therefore, readers are most likely to be
interested in receiving books from their preferred places. Consumers anticipate receiving
customized edition of books which are appropriate for them prominently displayed on
the initial pages. This type of work is carried out on a variety of worldwide book read-
ing apps. Reading the whole book and detecting the genre is labour intensive and not
productive use of time. Thus, text categorization is a task that has both commercial and
labor-saving implications. In the text categorization field, extensive researches on many
2
languages have been performed. However in Bangla not many research has been done.
Although, The Bangla language has an extensive heritage and its one of the most spo-
ken languages all over the world Native speakers of Bangla language are approximately
8% of the world population [5] and in terms of population, Bangla is the worlds fourth
most popular language [6]. Thus, it is important to automatically arrange and categorize
Bengali Books so that users may conveniently find relevant information.
1.2 Motivation
As the volume of textual data has exploded in recent years, classifying documents us-
ing an automated technique has emerged as a crucial job in the field of natural language
processing. Application is a crucial consideration when choosing a method for automat-
ically classifying Bangla text documents since it is required to distinguish between text
data from many domains in many different areas of expertise, including consumer goods,
law, healthcare, and education, to name just a few. In addition, text data in many fields is
growing as native Bengalis use the internet more often. In addition to applicability, a small
number of techniques have used well-known supervised algorithms in their selection of
algorithms for the task [7]. There is no proven comparison of all the well-known and
widely-applied classification methods. Additionally, even though the majority of these
studies make use of sizable datasets, there may be fewer training examples in brand-new
real-world initiatives specially in Bangla language. Therefore, it should be understood
that performance evaluation on small datasets using current approaches is just as impor-
tant as on huge datasets [8].
A technological revolution is currently taking place. In the digital age we live in today,
technology is omnipresent. We must advance Bengali as a language in the field of Natural
Language Processing (NLP) in order to compete with other languages that are excelling
in this area. we aim to create a dataset of bangla books and categorize Bangla Book text to
their respective genre using the machine learning algorithms and neural networks. Future
researchers will have more room to work with my dataset thanks to this work, which also
provides a quick overview of the effectiveness of our employed algorithms for the dataset
of Bangla book texts. To compare which supervised learning approach is performing
better we have used four performance metrics which are accuracy, precision, recall and
f1-score. Performances of the algorithms applied on the classification of Bangla book
3
text are shown in our work and our experiment were conducted on 118 Bangla text books
that included six text genres. There are six different types of genre that we worked on
romance, detective, war, horror, sci-fi, history.
The contributions made by this thesis project are outlined briefly as follows:
• To create a system that can classify Bangla books according to their preferred genre
and do a performance evaluation on it.
• We have created our own Bangla book dataset to do our genre classification as
no other dataset on it is available. The created dataset is unique in nature. We
have gathered 118 books from various online sources and built a dataset containing
information about their authors, genres, and other characteristics that will be made
publicly available in order to aid in future study and the improvement of Bangla
text categorization in the field of Natural Language Processing.
• To visually represent certain analytical study of textual data from books in order to
improve understanding and get more effective outcomes from the data.
In chapter 2, we have discussed some of the related works that have been done by other
researchers. How we have collected the data, pre-processed, classifiers we have used are
described in chapter 3. We have analyzed our findings in chapter 4. At the end we drew a
conclusion about our work in the 5th chapter.
4
2 Literature Review
Understanding natural language in text or any other form might be an easy task for human
beings; however, scrutinising the structure of a language, describing the underlying con-
cept and applying these intricacies to address task specific solutions is a tricky endeavour
for computers. Nonetheless, diverse methods have been incorporated and achieved good
outcome regarding miscellaneous language processing tasks over period of time, docu-
ment categorisation being one of them. Document categorisation refers to a process of
classifying unlabelled documents into one or more predefined classes depending on their
contents. In this section, literature review is described based on various aspects such as
language, data source, algorithms and results.
Since machine learning has taken off, people from all over the world have used it for a
variety of purposes and have developed algorithms for it based on their applications and
research publications. Numerous experts from Bangladesh and other countries engaged
with various problems related to the categorization of Bengali texts. While doing research,
We was unable to locate any articles on the classification of Bangla literature genres. We
thus researched related fields including text categorization, natural language processing,
and literary classification.
Reference [9] worked on a dataset where they worked on books that were translated from
Hindi & Gujrati to English. They predicted the genre of the books. Then process text
using few techniques for cleaning text in particular, punctuation removal, digit removal,
tokenization. Then for feature extraction the author used Tfidfvectorizer. Finally used
applied supervised learning algorithm K Nearest Neighbor, Support Vector Machine and
5
Logistic Regression. The Support Vector Machine gives the best accuracy than other mod-
els by giving 54.54% accuracy.
Reference [10] automatic classification of news articles from the regional newspaper La
Capitaolf Rosario, Argentina, was done using machine learning techniques. The corpus is
a collection of about 75,000 manually classified articles written in Spanish and released in
1991. In order to demonstrate the corpus qualities, they benchmarked on LCC using three
popular supervised learning techniques: k-Nearest Neighbors, Naive Bayes, and Artificial
Neural Networks. Naive Bayes outperformed other algorithms.
A research was conducted on Urdu text classification [11] with 16678 documents and
13 classes. In the research the author used different machine learning algorithm such
as SVM, Decision Tree Classifier, K-Nearest Neighbours Classifier. SVM was the best
performer there with the accuracy of 68.73%.
In this article [12] the author used two dataset to run different Machine Learning and
Deep Learning models. The dataset on has 1425 documents and dataset two has 169791
documents. For text representation, two type techniques Bag of words and TF-IDF model
are used for converting string features into numerical features for performing the mathe-
matical operation. In the end author compared different algorithms on both dataset and
discussed the data. Using Neural Network gave the highest accuracy on both dataset in
dataset one 92.63% and in dataset two 95.50%. From machine learning algorithms Naive
Bayes performed the best in dataset one with 91.23% and in dataset two SVM performed
the best with 94.99%. Here we can see large dataset gives more accuracy as it has more
data to train on.
In 2022 a paper [13] presents a method where the author did a deep cleaning of the text
data before putting the data in the model. The author did a lot of data preprocessing be-
fore using any models. It processed the data first removing the stopwords, then removed
puctuations as text classification does not need punctuations, then removed unnecessary
unicode characters finally did stemming. It proposed CNN-LSTM complex hybrid ap-
proach where he got 88.56% accuracy.
Another article [14] on Bangla Text document categorization authors have used a dataset
consisting of newspaper articles which is divided into twelve classes. then Duplicate arti-
6
cles were found and removed. The dataset must only be Bangla, so punctuation, English
words, and digits were removed using Unicode values. The bnlp toolkit was used for
stop-word removal. The data was tokenized and the class dataset was mapped to numeric
values manually. The dataset was split into 80% for training and 20% for testing using
stratification. Tokenized words were used as features and TFiDF was used for feature
selection. Finally applied Random Forest, Multinomial NB, Logistic Regression, MLP,
XGBoost, SVM, LSTM where LSTM gave them the best result 87% accuracy & f1-score.
In LSTM the author used early stopping to reduce overfitting. After four epoch the LSTM
model gave desired result.
For the classification of Bengali news, a study team from Shahjalal University of Sci-
ence & Technology employed a variety of machine learning-based methods, including
baseline and deep learning models [15]. They employed foundational models including
Naive Bayes, Logistic Regression, Random Forest, Linear SVM, and CNN as well as deep
learning models like BiLSTM. They discovered that the Support Vector Machine which
gave 91% accuracy, used as the base model, and CNN, used for deep learning which gave
93.43%, produced the best results for their dataset.
2.2 Challenges
The rigorous process of compiling and enhancing dataset was the most difficult part of
this undertaking. Stop words, non-English text, Numbers, Unicode and punctuation have
to be eliminated as part of a sequence of rigorous stages. Due to the multiple models
had to be trained, both the machine learning and deep learning models’ training phases
turned out to be very time-consuming. As a result, given the size of the dataset, getting
the final findings from all of these models required a great deal of patience. We began
this project from beginning, including activities like data gathering, data cleaning, data
pre-processing, and the development of all five models. My drive was the primary force
behind the entire procedure.
7
3 Methodology
In the following section, we will guide you through the distinct steps of our methodology
for categorizing Bangla Books. This process involves several crucial stages, including
data collection, data cleaning, data pre-processing, feature extraction, classification, and
evaluation. These steps are essential to ensuring the accuracy and effectiveness of our
approach. Throughout this framework, we have incorporated supporting elements such
as equations, diagrams, figures, and tables to enhance your understanding of the process.
This chapter provides details about our proposed method and other experiments conducted
in this study.
8
3.1 Data Collection
To train and evaluate Bangla books, the books [16][17][18][19] were collected from on-
line sources like GitHub and Google Drive, where they have shared their books for per-
sonal use. The books that were collected were in epub format. The epubs can be easily
converted to text. After the epubs were collected, we started to create the dataset, where
we converted the epubs to text format. Then each book was labeled manually with its
author name, author gender, book title, genre, publish date, book text using my own web
application [20] that we created for data labeling, which outputs data as json. The gen-
res that we selected are romance, detective, war, horror, sci-fi, history. To label the book
dataset, we have collected all the information about the books from Rokomari [21] and
GoodReads [22]. This process resulted in a new dataset of labeled Bangla Books that can
be used to train and benchmark Bangla Book Genre Classification. This phase took the
most time, as the books and the book information were not widely available.
9
3.2 Data Insight
This dataset has a total of six attributes author name, author gender, book title, genre, pub-
lish date, and book text. The dataset, which consists of 118 books in six different genres,
such as romance, detective, war, horror, sci-fi, history, will be used in our experiment with
different algorithms. In the Table (3.1) you can observe the number of data per genre we
have collected and in the Figure (3.3) the percentage of data genre wise. In the Table (3.2)
the author distribution the dataset.
Genre Books
Romance 30
Detective 28
War 21
Sci-fi 15
Horror 14
History 10
Total 118
10
Table 3.2: Authorship Distribution
11
3.3 Data Pre-Processing
We have collected around 118 books, which inevitably come with storage and memory
challenges during preprocessing as the text data is from books. To enhance efficiency, we
employed the pandas library to convert the JSON data from the labeling application into
a more manageable CSV format. Given the limitations of a small dataset, we opted to
segment large books into smaller chunks based on their word count [23]. This approach
enables classifiers to concentrate on more manageable and contextually relevant sections,
leading to potential performance enhancements. To determine the optimal chunk size, we
conducted Logistic Regression experiments using various dataset versions, each with a
distinct word count. The word counts ranged from 10,000 to 50,000 words. Remarkably,
the dataset with a word count of 20,000 consistently delivered the most promising results.
Consequently, We concluded that partitioning the dataset into 20,000-word segments was
the most favorable strategy. With this initial conversion complete (See Table 3.2 for after
chunking dataset), We delved into the core data processing phase. Given our focus on
training, there were English words, punctuation, special characters, emojis, white space,
and digit numbers in English and Bangla had to be removed from the dataset to make
the dataset ready for training and testing. In our case, there is a lot of text data as we
are working on books. We eliminated some of this unnecessary data, such as stopwords
[24], a set of words that hold no significant value in our genre classification pursuit, these
are some common words used in Bangla. This removal of stopwords serves a dual pur-
pose: it reduces noise within the data and diminishes dimensionality, thereby speeding up
processing and ultimately amplifying the model’s performance.
12
3.4 Feature Extraction
As Machine learning only takes numerical data as input, in this Feature Extraction step,
we used the Term Frequency-Inverse Document Frequency (TF-IDF) technique to further
refine and represent the text data from the books in a manner useful to machine learn-
ing tasks. TF-IDF is often opted as a tool for feature extraction in a variety of Natural
Language Processing (NLP) tasks or text mining tasks [25] [26]. The process includes
tokenizing the text data, creating a document-term matrix, and calculating TF-IDF scores
for each term. This approach captures the importance of terms within individual docu-
ments while considering their rarity across the entire dataset. TF-IDF scores emphasize
significant terms that are both common within documents and unique across the dataset.
By applying TF-IDF, the text data is converted into a structured numerical representation
that retains key information for genre classification. This representation reduces dimen-
sionality, making it suitable for training machine learning models to accurately classify
book genres based on learned patterns and relationships within the data.
√
n
wi = (T Fi × log (N ÷ ni )) ÷ ∑ (T Fi × log (N ÷ ni ))2 (3.1)
i=1
A Bag of Words (BoW) is a simple and fundamental technique in natural language pro-
cessing (NLP) for text analysis and document classification. It represents a text document
as a collection of individual words or tokens, ignoring grammar and word order. The
resulting "bag" is essentially a frequency distribution of words, where each word in the
document is treated as a separate entity and its occurrence is counted [27].
3.5 Classification
Naive Bayes is a classification method based on Bayes’s theorem. It assumes that the
predictors are independent. In simpler terms, it treats each predictor as if it contributed
independently to the outcome. This classifier is particularly useful for large datasets and
often performs well. Bayes’s theorem is a probability concept that deals with conditional
13
probabilities. It helps us calculate the probability of one event happening given that an-
other event has already occurred. This is called a conditional probability. By using past
data, we can use conditional probability to calculate the likelihood of an event occurring.
P(B | A)P(A))
P(A | B) = (3.2)
P(B)
Here,
• P(A) represents the initial probability of hypothesis H being true. This is called
prior probability.
• P(A|B) indicates the probability of hypothesis being true, given the evidence.
• P(B|A) shows the probability of evidence being true, given the hypothesis.
In our case, Naive Bayes Multinomial (NBM) is a specialized iteration of the Naive
Bayes algorithm tailored for text classification and tasks within natural language pro-
cessing. Naive Bayes Multinomial (NBM) is reviewed comprehensively by several re-
searchers in terms of text classification tasks [28]. Its efficacy shines when handling
discrete data, particularly word counts in documents. The "multinomial" aspect pertains
to its suitability for scenarios involving multiple distinct classes. NBM presents itself
as a straightforward implementation option, yielding noteworthy performance across a
spectrum of text classification tasks, particularly when working with datasets of moderate
scale.
The support vector machine algorithm is well-suited for text classification tasks A Support
Vector Machine (SVM) is a classifier that distinguishes between data by a separating
hyperplane. The hyperplane is an important line that divides the plane into two different
halves, along with every class having their own side. All this happens in two-dimensional
space.
14
Figure 3.6: Support Vector Machine (SVM)
In the SVM (Support Vector Machine) algorithm, our objective is to enhance the sep-
aration between the data points and the hyperplane. Several researchers have thoroughly
studied Support Vector Machine (SVM) in terms of text classification tasks [29]. The loss
function responsible for increasing this separation is known as hinge loss.
The cost is minimized to zero when the predicted value and the actual value share the
same sign. However, if they have opposite signs, the algorithm calculates the loss by de-
termining the extent of this mismatch and incorporating a regularization term into the cost
function. This regularization term is introduced to balance the trade-off between maxi-
mizing the separation margin and minimizing the loss. Upon adding this regularization
term, the cost functions take on the following form:
Utilizing fractional derivatives with respect to the weights enables the determination of
the directions in which the weights should be updated. In the event of a misclassification,
this process involves incorporating the loss along with the regularization term to facilitate
the gradient update.
w = w + α · (yi · xi − 2λ w) (3.5)
15
3.5.3 Logistic Regression
Logistic Regression stands out as a prominent classification algorithm renowned for its
effectiveness in handling categorical data. This algorithm proves particularly valuable
when dealing with scenarios involving binary outcomes. The essence of Logistic Regres-
sion lies in predicting the probability of a specific outcome, denoted as P(Y = 1), based
on the input data, X. The resulting prediction resides within the range of 0 to 1 and is
facilitated by the construction of a logistic curve [30]. Logistic Regression finds extensive
utility in various data modeling tasks, such as identifying spam, analyzing movie reviews
as positive or negative, and detecting tumor malignancy. It occupies a significant position
within a category of models known as generalized linear models. While Logistic Regres-
sion and linear regression share resemblances, they diverge in their curve construction. In
Logistic Regression, the curve is shaped by the natural logarithm of the odds of the target
variable. Distinct variations of Logistic Regression cater to different scenarios. In the case
of binomial or binary logistic regression, only two possible outcomes exist. Multinomial
logistic regression accommodates situations where three or more non-ordered categories
are feasible. On the other hand, ordinal logistic regression finds its niche in handling
ordered dependent variables. To illustrate, consider Figure 2.1, which offers an exempli-
fication of a basic binary logistic regression model.
16
3.5.4 Multi-layer Perceptron (MLP)
The Multilayer Perceptron (MLP) stands as a widely embraced and prevalent neural net-
work model in the realm of deep learning [31]. This model comprises three distinct layer
types. Firstly, the input layer takes in all pertinent inputs. On the flip side, the output
layer constitutes the final stage of the network. In our specific scenario, the output layer
encompasses 6 nodes to accommodate the 6 available classes. Between the input and out-
put layers lies the hidden layera sequential network of perceptrons. Conceivably, multiple
hidden layers may exist within an MLP, although we have opted for a default configuration
of 1000 nodes in our case. Our approach involves a maximum of 200 iterations.Within
the fully connected network of perceptrons, each node emerges as a linear amalgamation
of weighted contributions from nodes in the preceding layers. These nodes subsequently
undergo an activation function. Stated differently, the value of each node corresponds to
the summation of products between its connected values and associated weights. This
procedural sequence ensures the computation of every layer, culminating in the final out-
put layer and the realization of desired outcomes. For an illustrative depiction of the MLP
architecture, refer to figure (3.8)
3.5.5 CNN
17
trons as building blocks for analyzing data. These networks utilize three-dimensional
layers in which only a subset of neurons maintains connections with the preceding layer.
The structure of CNNs comprises various layers, including convolutional layers (kernels),
pooling layers, rectified linear unit (ReLU) layers, and fully connected layers. The out-
put of these layers is then fed into the neurons of a neural network. The inspiration for
deep convolutional neural networks comes from their incorporation of local connections
between layers and their capacity to achieve spatial invariance. A common practice in
CNN architecture involves interleaving pooling layers with successive convolutional lay-
ers (kernels) to mitigate overfitting by reducing the number of parameters and computa-
tional demands. CNNs excel in extracting pertinent information from input data, and by
reshaping this input data, they prepare it for further application of methods like Artificial
Neural Networks.
3.6.1 Accuracy
However, it has limitations, particularly when dealing with imbalanced datasets where
one class is significantly more prevalent than others. In such cases, a high accuracy might
not reflect the true effectiveness of a model, as it could be heavily biased toward the
dominant class. Thus, while accuracy provides a quick overview of a model’s overall
performance, it should be interpreted alongside other metrics, especially when facing
complex or imbalanced data distributions.
The confusion matrix serves as a tool to assess the performance of a classification algo-
rithm. It becomes particularly valuable in scenarios where there is an unequal distribution
of samples across categories or when dealing with datasets containing more than two
18
groups. Relying solely on classification accuracy can be misleading under these circum-
stances. Hence, the confusion matrix is utilized to provide a clearer representation of
outcomes. Within the confusion matrix, the counts of predicted and actual cases for both
positive and negative outcomes are presented. This framework offers insights into the
model’s behavior and its effectiveness in differentiating between classes.
• True Positives (TP) : The cases in which we predicted YES and the actual output
was also YES.
• True Negatives (TN) : The cases in which we predicted NO and the actual output
was NO.
• False Positives (FP) : The cases in which we predicted YES and the actual output
was NO.
• False Negatives (FN) : The cases in which we predicted NO and the actual output
was YES.
3.6.3 Precision
Precision in machine learning is a measure that focuses on how accurate a model is when
it predicts a positive outcome. It looks at the proportion of correctly predicted positive
19
cases out of all the cases the model predicted as positive. In simple terms, precision
helps us understand how good the model is at avoiding false positives situations where it
wrongly predicts something as positive when it’s actually not. A high precision indicates
that when the model says something is positive, it’s usually right.
True Positive
Precision = (3.7)
(True Positive + False Positive)
3.6.4 Recall
Recall in machine learning is a measure that focuses on capturing all the relevant positive
cases. It calculates the proportion of correctly predicted positive cases out of all the actual
positive cases in the dataset. In simpler terms, recall helps us understand how well the
model is at finding all the important things it should find.
True Positives
Recall = (3.8)
(True Positives + False Negatives)
3.6.5 F1-Score
The F1 score in machine learning combines precision and recall metrics to provide a fair
assessment of the performance of a model. This is especially helpful when trying to
strike a balance between reducing false positives and false negatives. The harmonic mean
of recall and precision was used to construct the F1 score, with lower values between
the two being given greater weight. In other words, the F1 score penalizes models that
significantly favor one over the other and rewards those that have high precision and recall.
It is like finding the ideal balance between being cautious not to make too many errors
and still ensuring that crucial items are not missed. When assessing a model’s overall
performance, particularly in circumstances where accuracy and recall must be balanced,
such as text classification or spam detection, the F1 score is useful.
recall × precision
F1 − Score = 2 × (3.9)
recall + precision
20
4 Experimental Evaluation
In this section, we present a comprehensive overview of the output and behavioral pro-
cesses utilized in the Bangla Book Genre Classification across all models. The evaluation
approach encompasses several crucial steps, including the creation of confusion matrices
and the generation of Classification Reports. Together, these steps provide insights into
the strengths and weaknesses of the models. The confusion matrices visually depict the
models’ predictions compared to the actual genres, facilitating a clear understanding of
their areas of proficiency and limitations. These matrices are a valuable tool for identify-
ing misclassification patterns and highlighting genres that may challenge the models. In
addition to the confusion matrices, the Classification Reports offer a more detailed analy-
sis of the models’ performance. They include metrics like precision, recall, F1-score, and
support for each genre category, providing a comprehensive breakdown of their capabili-
ties.
4.1 Findings
Table (4.1) and Table (4.2) presents the classification report for our logistic regression
analysis. Our experimental results yielded an impressive accuracy of 85% for this model
with Bag of Words (BoW) and 89% with TF-IDF, signifying a notable achievement in its
performance. TF-IDF giving 4.7% more accuracy than BoW.
21
Table 4.1: Logistic Regression Classification Report (BoW)
Figure 4.1: Confusion Matrix of Logistic Regression - BoW (left) and TF-
IDF(right)
22
4.1.2 Support Vector Machine
Table (4.3) and Table (4.4) presents the classification report for our Support Vector Ma-
chine . Our experimental results yielded an impressive accuracy of 83% for this model
with Bow and 94% accuracy with TF-IDF, signifying a notable achievement in its performance.TF-
IDF giving 13.25% more accuracy than BoW in this model. The optimal set of parameters
that we used ’C’: 10, ’degree’: 2, ’gamma’: 0.1, ’kernel’: ’linear’.
23
Figure 4.2: Confusion Matrix of Support Vector Machine - BoW (left) and TF-
IDF(right)
24
Table 4.6: Naïve Bayes Multinomial Classification Report (TF-IDF)
Figure 4.3: Confusion Matrix of Naïve Bayes Multinomial - BoW (left) and
TF-IDF (right)
4.1.4 MLP
Here, we used a hidden layer of 1000 and used early stopping so the model does not
overfit. Table (4.7) and Table (4.8) presents the classification report for our MLP . Our
experimental results yielded an impressive accuracy of 88% for this model with Bow and
93% accuracy with TF-IDF, signifying a notable achievement in its performance.TF-IDF
giving 5.68% more accuracy than BoW in this model.
25
Table 4.7: MLP Classification Report (BoW)
26
4.1.5 CNN
Early Stopping, implemented through the callbacks API, was employed in our study to
enhance the training process of our Convolutional Neural Network (CNN) model. Our
chosen metric for monitoring was Sparse Categorical Accuracy. Specifically, we aimed
to halt the training procedure if no discernible improvement in validation accuracy was
observed for a consecutive span of five epochs. This strategy was implemented to prevent
overfitting and optimize the model’s generalization capabilities. As depicted in Figure
(4.5), the graph illustrates the trend in accuracy in left and the corresponding loss values in
right throughout the training process using BoW. Also in Figure (4.6), the graph illustrates
the trend in accuracy in left and the corresponding loss values in right throughout the
training process using TF-IDF. It is important to note that the weights utilized for the
model were those retrieved from the epoch that exhibited the highest validation accuracy.
This approach ensures that the model is equipped with the most optimal parameter values,
thereby potentially improving its performance on unseen data. This technique aligns with
our overarching goal of training an efficient and effective CNN model for the task at hand.
Figure 4.5: CNN Model Accuracy (left) and Loss (right) using BoW
27
Figure 4.6: CNN Model Accuracy (left) and Loss (right) using TF-IDF
Table (4.9) and Table (4.10) displays the CNN classification report, offering an overview
of the model’s classification performance. Additionally, Figure (4.7) presents the CNN
confusion matrix, visually summarizing the alignment between predicted and actual class
labels with both text representation technique BoW in left and TF-IDF in right.
28
Table 4.10: CNN Classification Report (TF-IDF)
In summary, SVM and Naive Bayes Multinomial with feature extraction TF-IDF gave
the best result among all the algorithms of 0.94 accuracy and CNN with feature extrac-
tion Bow has given the least accuracy 0.81. The performance comparison between our
classifiers is shown below. In figure (4.8) and figure (4.9) show the performance compar-
ison of different machine learning and neural network techniques. In Table (4.11) gives
you a summary of all the models we have ran with BoW and TF-IDF feature extraction
techniques.
29
Figure 4.8: Model Evaluation of Different Models (BoW)
30
Table 4.11: Performance comparison
As we couldnt find any dataset on Bangla book genre classification, we decided to com-
pare our results with similar text classification works. Table (4.12) shows the comparison
of our work with some similar works which are previously done. By comparing with some
previous works, we can see that our dataset is quite smaller than other research however
our results are promising.
31
5 Conclusion
This chapter offers the thesis’s final remarks, a list of its shortcomings, and information
on where it will go from here. The thesis’ Conclusion is stated in section 5.1. Limitations
are explored in part 5.2, and the thesis’s future path is defined in section 5.3.
5.1 Conclusion
In this thesis, we give a thorough analysis of the categorization of Bangla book genres us-
ing a variety of machine learning and neural network based techniques. We used machine
learning based Logistic Regression, Support Vector Machine (SVM), Naïve Bayes Multi-
nomial and also some neural network based Multilayer perceptron(MLP), Convolutional
Neural Network (CNN). Where SVM and Naive Bayes Multinomial gave the best result
among all the algorithms of 94% accuracy. All of this model can categorize the six types
of genre. We built an extensive system that can categorize Bangla books textual input data
and provide prediction of genre.
5.2 Limitations
The limitation is of this work is the amount of data. Because of small dataset the horror
genre cannot be predicted accurately. The horror books gets incorrectly classified since
some other books genre can have similar characteristics. More horror book can solve this
problem. And a book can have characteristics of multiple genres, implementing multi-
label categorization can improve this limitation.
32
5.3 Future Work
The performance and application of the system may yet be improved, despite the fact that
our study has made considerable strides in Bangla book genre classification. Future study
might focus on a number of issues, including:
• Add more books to our dataset. This expanded dataset will provide a more com-
prehensive and robust foundation for your analysis, research, or machine learning
models.
• Adding Bangla stemming and lemmatization. For the shortage of having good qual-
ity Bangla stemming and lemmatization library we did not use one.
• We can use other feature selection methods that were used in other languages that
we did not use
33
References
[2] S. Tong and D. Koller, “Support vector machine active learning with applications
to text classification,” Journal of machine learning research, vol. 2, no. Nov, pp.
45–66, 2001.
[3] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text
categorization,” in Third annual symposium on document analysis and information
retrieval, vol. 33, 1994, pp. 81–93.
[4] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint
arXiv:1408.5882, 2014.
[7] A. I. Kadhim, “Survey on supervised machine learning techniques for automatic text
classification,” Artificial Intelligence Review, vol. 52, no. 1, pp. 273–292, 2019.
[8] G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, and F. E. Alsaadi, “Evaluation of feature
selection methods for text classification with small datasets using multiple criteria
decision-making methods,” Applied Soft Computing, vol. 86, p. 105836, 2020.
34
[9] B. Y. Panchal, “Book genre categorization using machine learning algorithms (k-
nearest neighbor, support vector machine and logistic regression) using customized
dataset,” March 2021, available at SSRN: https://ssrn.com/abstract=3805945.
[Online]. Available: https://ssrn.com/abstract=3805945
[11] I. Rasheed, V. Gupta, H. Banka, and C. Kumar, “Urdu text classification: A com-
parative study using machine learning techniques,” in 2018 Thirteenth International
Conference on Digital Information Management (ICDIM), 2018, pp. 274–278.
[15] M. Hossain, S. Sarkar, and M. Rahman, “Different machine learning based ap-
proaches of baseline and deep learning models for bengali news categorization,”
International Journal of Computer Applications, vol. 176, pp. 10–16, 04 2020.
35
[19] Pinu’s kindle. [Online]. Available:
https://drive.google.com/drive/u/0/folders/1FAukxq7IzhUgqKA9VRxShObgN ipV 2G
[21] (2023) Rokomari book store. Accessed on June 01, 2023. [Online]. Available:
https://www.rokomari.com/book
[23] S. Todeschini. How to chunk text data a comparative analysis. [Online]. Avail-
able: https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-
3858c4a0997a
[25] G. Forman and I. Cohen, “Learning from little: Comparison of classifiers given little
training,” in Knowledge Discovery in Databases: PKDD 2004, J.-F. Boulicaut, F. Espos-
ito, F. Giannotti, and D. Pedreschi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2004, pp. 161–172.
[26] K. Masuda, T. Matsuzaki, and J. Tsujii, “Semantic search based on the online
integration of nlp techniques,” Procedia - Social and Behavioral Sciences, vol. 27,
pp. 281–290, 2011, computational Linguistics and Related Fields. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1877042811024360
[28] P. Bolaj and S. Govilkar, “Text classification for marathi documents using supervised
learning methods,” Int. J. Comput. Appl, vol. 155, no. 8, pp. 6–10, 2016.
36
[29] Z.-q. Wang, X. Sun, D.-x. Zhang, and X. Li, “An optimal svm-based text classification al-
gorithm,” in 2006 International Conference on Machine Learning and Cybernetics, 2006,
pp. 1378–1381.
[32] Z. Wang and Z. Qu, “Research on web text classification algorithm based on improved cnn
and svm,” in 2017 IEEE 17th International Conference on Communication Technology
(ICCT), 2017, pp. 1958–1961.
37