0% found this document useful (0 votes)
14 views46 pages

Book Genre Classification Using ML

This thesis investigates the classification of Bangla book genres using various machine learning and neural network techniques, including Logistic Regression, Support Vector Machine, Naive Bayes Multinomial, Multi-layer Perceptron, and Convolution Neural Network. The study evaluates the performance of these models, finding that SVM and Naive Bayes Multinomial achieve relatively high accuracy in identifying six genre categories. The research contributes to the field of Natural Language Processing by applying these algorithms to Bangla text, which has not been extensively explored before.

Uploaded by

Afsarul Amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views46 pages

Book Genre Classification Using ML

This thesis investigates the classification of Bangla book genres using various machine learning and neural network techniques, including Logistic Regression, Support Vector Machine, Naive Bayes Multinomial, Multi-layer Perceptron, and Convolution Neural Network. The study evaluates the performance of these models, finding that SVM and Naive Bayes Multinomial achieve relatively high accuracy in identifying six genre categories. The research contributes to the field of Natural Language Processing by applying these algorithms to Bangla text, which has not been extensively explored before.

Uploaded by

Afsarul Amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Bangla Book Genre Classification Using Different Machine

Learning and Neural Network Techniques

A Thesis / Project Submitted in Partial Fulfillment of the Requirements for the


Degree of
Bachelor in Computer Science & Engineering
by
Ahmed Shafi Arnob
CSE 06907983

Supervised by: Tamanna Haque Nipa


Assistant Professor

Department of Computer Science and Engineering


STAMFORD UNIVERSITY BANGLADESH
September 2023
Abstract

Artificial intelligence has become the newest data science powerhouse in the modern
era.Since its inception, using Machine Learning, Deep Learning, and Computer Vision
algorithms in data analytics has gained popularity. The use of Logistic Regression, Sup-
port Vector Machine, Naive Bayes Multinomial, Multilayer Perceptrons, and Convolution
Neural Network to classify genre and examine the performance of these models on Bangla
book text has not yet been investigated. Hence, in this paper, we have proposed several
machine learning and neural network-based model construction in this work in order to
identify 6 sorts of genre categories of Bengali books data. We have evaluated the over-
all accuracy levels of Logistic Regression, SVM, Naive Bayes Multinomial, MLP, CNN
constructing models. Of all of them, SVM and Naive Bayes Multinomial have attained
relatively high accuracy.

i
Declaration

I, hereby, declare that the work presented in this Thesis is the outcome of the investigation
performed by us under the supervision of Tamanna Haque Nipa, Assistant Professor, De-
partment of Computer Science & Engineering, Stamford University Bangladesh. We also
declare that no part of this Thesis and thereof has been or is being submitted elsewhere
for the award of any degree or Diploma.

Signature and Date:

...........................................
Ahmed Shafi Arnob
Date:

Supervisor’s Signature and Date:

...........................................
Tamanna Haque Nipa

Date:

ii
Dedicated to ...
I would like to dedicate this thesis to my beloved parents and teachers.
Acknowledgments

I would like to begin by expressing our gratitude to our thesis supervisor, Tamanna Haque
Nipa, Asst. Professor, Department of Computer Science and Engineering, Stamford Uni-
versity Bangladesh. Throughout our research and writing process, she consistently pro-
vided us with valuable guidance and support. Her office was always open to address any
queries or challenges we encountered. We sincerely appreciate her mentor-ship and the
way he allowed us to take ownership of this paper while guiding us in the right direction.In
this thesis, she was a never-ending source of inspiration, motivation, and encouragement.
Furthermore, we are deeply grateful to the faculties, friends, and family members associ-
ated with Stamford University Bangladesh. Their influence, support, and encouragement
have played a vital role in our journey. We acknowledge that without their appreciation,
guidance, and assistance, this accomplishment would not have been possible. They have
been a constant source of inspiration and motivation throughout this thesis endeavor. In
conclusion, we sincerely thank everyone who has contributed to our research and thesis,
ensuring its successful completion.

iv
Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures vii

List of Tables viii

1: Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Aim of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2: Literature Review 5
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 In Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 In Bangla Language . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3: Methodology 8
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Term Frequency - Inverse Document Frequency (TF-IDF) . . . . . 13
3.4.2 Bag of Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.1 Naïve Bayes Multinomial . . . . . . . . . . . . . . . . . . . . . . 13
3.5.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 14
3.5.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.4 Multi-layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 17
3.5.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Evaluation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.2 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.5 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4: Experimental Evaluation 21
4.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Naïve Bayes Multinomial . . . . . . . . . . . . . . . . . . . . . . 24
4.1.4 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.6 Result Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Comparing with Similar Papers . . . . . . . . . . . . . . . . . . . . . . . 31

5: Conclusion 32
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

References 34

vi
List of Figures

3.1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


3.2 Data Collection Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 First few rows of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Total percentage data as genre wise . . . . . . . . . . . . . . . . . . . . . 10
3.5 Data Pre-Processing Steps . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Comparison Between Linear model and Logistic model . . . . . . . . . . 16
3.8 MLP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Confusion Matrix of Logistic Regression - BoW (left) and TF-IDF(right) . 22


4.2 Confusion Matrix of Support Vector Machine - BoW (left) and TF-IDF(right) 24
4.3 Confusion Matrix of Naïve Bayes Multinomial - BoW (left) and TF-IDF
(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Confusion Matrix of MLP - BoW (left) and TF-IDF(right) . . . . . . . . 26
4.5 CNN Model Accuracy (left) and Loss (right) using BoW . . . . . . . . . 27
4.6 CNN Model Accuracy (left) and Loss (right) using TF-IDF . . . . . . . . 28
4.7 Confusion Matrix of CNN - BoW (left) and TF-IDF(right) . . . . . . . . 29
4.8 Model Evaluation of Different Models (BoW) . . . . . . . . . . . . . . . 30
4.9 Model Evaluation of Different Model (TF-IDF) . . . . . . . . . . . . . . 30

vii
List of Tables

3.1 Bangla Book Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


3.2 Authorship Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Logistic Regression Classification Report (BoW) . . . . . . . . . . . . . 22


4.2 Logistic Regression Classification Report (TF-IDF) . . . . . . . . . . . . 22
4.3 Support Vector Machine Classification Report (BoW) . . . . . . . . . . . 23
4.4 Support Vector Machine Classification Report (TF-IDF) . . . . . . . . . . 23
4.5 Naïve Bayes Multinomial Classification Report (BoW) . . . . . . . . . . 24
4.6 Naïve Bayes Multinomial Classification Report (TF-IDF) . . . . . . . . . 25
4.7 MLP Classification Report (BoW) . . . . . . . . . . . . . . . . . . . . . 26
4.8 MLP Classification Report (TF-IDF) . . . . . . . . . . . . . . . . . . . . 26
4.9 CNN Classification Report (BoW) . . . . . . . . . . . . . . . . . . . . . 28
4.10 CNN Classification Report (TF-IDF) . . . . . . . . . . . . . . . . . . . 29
4.11 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.12 Comparing Our Work with Previous Work (Accuracy) . . . . . . . . . . 31

viii
1 Introduction

As the world becomes increasingly reliant on technology, the fusion of artificial and hu-
man brains has pushed the boundaries of cognitive capabilities. This synergy has given
rise to Artificial Intelligence (AI), a concept that has gained immense popularity globally.
Within the realm of AI, Natural Language Processing (NLP) holds particular prominence,
especially in the field of text classification research. NLP serves as a prominent method
for searching, analyzing, comprehending, and deriving insights from text-based data.

The diversity of human languages further complicates matters, with individuals express-
ing themselves in their native tongues such as Spanish, Bangla, Chinese, and English.
However, it falls upon computers to dissect and interpret the meanings within these texts.
NLP empowers computers to extract meaningful information with greater sophistication.
Over recent years, the prevalence of NLP has surged. This technique finds application in
various domains including text classification, information extraction, speech tagging, and
more.

The number of people using smartphones and other digital devices is rising quickly, and
so is the number of people using the internet [1]. As a result, consumption of digital
content has substantially increased recently, and reading books is no exception. The day
when people read actual books is long past. Nowadays people read books on their smart-
phones or eReaders like Amazon Kindle. Books from all languages are read on these
devices. Bangla is no exception. Due to the growth of Bangla book reading on digital
devices, publishers now publishing their books digitally so the readers can easily read the
books on their favourite devices. A large number of eBooks are published daily. So, the
extensive and increased electronic availability of Bangla text documents enhances the ne-
cessity of automatic methods to analyze those text documents content to serve the readers
of those eBooks well. If Books are categorized according to their appropriate genre, then
it will be quick and efficient to search and retrieve information. Also we can recommend

1
readers books based on the genre of books they are reading. There are some other usage
of text classification besides books classification such as email filtering, spam detection,
sentiment analysis etc.

In the field of text categorization in languages other than English, there is sometimes
a scarcity of large datasets for training models, which hamper the achievement of superior
results. This is especially true for the Bengali language, where we found a lack of datasets
relevant to Bengali books. As a result, we set about creating our own dataset to help with
my research.

Among NLP experts, supervised classification algorithms have been a common choice
for problems involving document classification [2][3]. Furthermore, recent work on text
classification using neural networks (NN) using word vectors has produced outstanding
results [4]. In this study, we conduct a thorough examination of the difficulties asso-
ciated with identifying textual data in Bengali. My dataset has 118 examples from six
different genres. We used Term Frequency-Inverse Document Frequency (TF-IDF) and
Bag of Words (BoW) as feature selection technique is designed aiming to classify Bangla
electric text documents. Then we ran comparison experiments on numerous models, in-
cluding Logistic Regression, Nave Bayes Multinomial, Support Vector Machine (SVM),
Multilayer Perceptron (MLP), and Convolutional Neural Networks (CNN).

1.1 Problem Statement

The technique of text classification is utilized to provide a broad view of the system. In
addition to other things, it is quite helpful for filtering emails and managing Web content,
search engines, and other things. Technology development has increased interest in text
categorization issues. On the internet, there are many Bangla language documents that are
both valuable and challenging to adequately categorize into their corresponding semantic
categories. If documents are organized into their appropriate categories, searching and
information retrieval are quick and simple. Furthermore, a reader prefers to reads books
which is most interesting to him/her from screen Therefore, readers are most likely to be
interested in receiving books from their preferred places. Consumers anticipate receiving
customized edition of books which are appropriate for them prominently displayed on
the initial pages. This type of work is carried out on a variety of worldwide book read-
ing apps. Reading the whole book and detecting the genre is labour intensive and not
productive use of time. Thus, text categorization is a task that has both commercial and
labor-saving implications. In the text categorization field, extensive researches on many

2
languages have been performed. However in Bangla not many research has been done.
Although, The Bangla language has an extensive heritage and its one of the most spo-
ken languages all over the world Native speakers of Bangla language are approximately
8% of the world population [5] and in terms of population, Bangla is the worlds fourth
most popular language [6]. Thus, it is important to automatically arrange and categorize
Bengali Books so that users may conveniently find relevant information.

1.2 Motivation

As the volume of textual data has exploded in recent years, classifying documents us-
ing an automated technique has emerged as a crucial job in the field of natural language
processing. Application is a crucial consideration when choosing a method for automat-
ically classifying Bangla text documents since it is required to distinguish between text
data from many domains in many different areas of expertise, including consumer goods,
law, healthcare, and education, to name just a few. In addition, text data in many fields is
growing as native Bengalis use the internet more often. In addition to applicability, a small
number of techniques have used well-known supervised algorithms in their selection of
algorithms for the task [7]. There is no proven comparison of all the well-known and
widely-applied classification methods. Additionally, even though the majority of these
studies make use of sizable datasets, there may be fewer training examples in brand-new
real-world initiatives specially in Bangla language. Therefore, it should be understood
that performance evaluation on small datasets using current approaches is just as impor-
tant as on huge datasets [8].

1.3 Aim of study

A technological revolution is currently taking place. In the digital age we live in today,
technology is omnipresent. We must advance Bengali as a language in the field of Natural
Language Processing (NLP) in order to compete with other languages that are excelling
in this area. we aim to create a dataset of bangla books and categorize Bangla Book text to
their respective genre using the machine learning algorithms and neural networks. Future
researchers will have more room to work with my dataset thanks to this work, which also
provides a quick overview of the effectiveness of our employed algorithms for the dataset
of Bangla book texts. To compare which supervised learning approach is performing
better we have used four performance metrics which are accuracy, precision, recall and
f1-score. Performances of the algorithms applied on the classification of Bangla book

3
text are shown in our work and our experiment were conducted on 118 Bangla text books
that included six text genres. There are six different types of genre that we worked on
romance, detective, war, horror, sci-fi, history.

1.4 Thesis Contributions

The contributions made by this thesis project are outlined briefly as follows:

• To create a system that can classify Bangla books according to their preferred genre
and do a performance evaluation on it.

• We have created our own Bangla book dataset to do our genre classification as
no other dataset on it is available. The created dataset is unique in nature. We
have gathered 118 books from various online sources and built a dataset containing
information about their authors, genres, and other characteristics that will be made
publicly available in order to aid in future study and the improvement of Bangla
text categorization in the field of Natural Language Processing.

• To visually represent certain analytical study of textual data from books in order to
improve understanding and get more effective outcomes from the data.

1.5 Thesis Outline

In chapter 2, we have discussed some of the related works that have been done by other
researchers. How we have collected the data, pre-processed, classifiers we have used are
described in chapter 3. We have analyzed our findings in chapter 4. At the end we drew a
conclusion about our work in the 5th chapter.

4
2 Literature Review

Understanding natural language in text or any other form might be an easy task for human
beings; however, scrutinising the structure of a language, describing the underlying con-
cept and applying these intricacies to address task specific solutions is a tricky endeavour
for computers. Nonetheless, diverse methods have been incorporated and achieved good
outcome regarding miscellaneous language processing tasks over period of time, docu-
ment categorisation being one of them. Document categorisation refers to a process of
classifying unlabelled documents into one or more predefined classes depending on their
contents. In this section, literature review is described based on various aspects such as
language, data source, algorithms and results.

2.1 Related Works

Since machine learning has taken off, people from all over the world have used it for a
variety of purposes and have developed algorithms for it based on their applications and
research publications. Numerous experts from Bangladesh and other countries engaged
with various problems related to the categorization of Bengali texts. While doing research,
We was unable to locate any articles on the classification of Bangla literature genres. We
thus researched related fields including text categorization, natural language processing,
and literary classification.

2.1.1 In Other Languages

Reference [9] worked on a dataset where they worked on books that were translated from
Hindi & Gujrati to English. They predicted the genre of the books. Then process text
using few techniques for cleaning text in particular, punctuation removal, digit removal,
tokenization. Then for feature extraction the author used Tfidfvectorizer. Finally used
applied supervised learning algorithm K Nearest Neighbor, Support Vector Machine and

5
Logistic Regression. The Support Vector Machine gives the best accuracy than other mod-
els by giving 54.54% accuracy.

Reference [10] automatic classification of news articles from the regional newspaper La
Capitaolf Rosario, Argentina, was done using machine learning techniques. The corpus is
a collection of about 75,000 manually classified articles written in Spanish and released in
1991. In order to demonstrate the corpus qualities, they benchmarked on LCC using three
popular supervised learning techniques: k-Nearest Neighbors, Naive Bayes, and Artificial
Neural Networks. Naive Bayes outperformed other algorithms.

A research was conducted on Urdu text classification [11] with 16678 documents and
13 classes. In the research the author used different machine learning algorithm such
as SVM, Decision Tree Classifier, K-Nearest Neighbours Classifier. SVM was the best
performer there with the accuracy of 68.73%.

2.1.2 In Bangla Language

In this article [12] the author used two dataset to run different Machine Learning and
Deep Learning models. The dataset on has 1425 documents and dataset two has 169791
documents. For text representation, two type techniques Bag of words and TF-IDF model
are used for converting string features into numerical features for performing the mathe-
matical operation. In the end author compared different algorithms on both dataset and
discussed the data. Using Neural Network gave the highest accuracy on both dataset in
dataset one 92.63% and in dataset two 95.50%. From machine learning algorithms Naive
Bayes performed the best in dataset one with 91.23% and in dataset two SVM performed
the best with 94.99%. Here we can see large dataset gives more accuracy as it has more
data to train on.

In 2022 a paper [13] presents a method where the author did a deep cleaning of the text
data before putting the data in the model. The author did a lot of data preprocessing be-
fore using any models. It processed the data first removing the stopwords, then removed
puctuations as text classification does not need punctuations, then removed unnecessary
unicode characters finally did stemming. It proposed CNN-LSTM complex hybrid ap-
proach where he got 88.56% accuracy.

Another article [14] on Bangla Text document categorization authors have used a dataset
consisting of newspaper articles which is divided into twelve classes. then Duplicate arti-

6
cles were found and removed. The dataset must only be Bangla, so punctuation, English
words, and digits were removed using Unicode values. The bnlp toolkit was used for
stop-word removal. The data was tokenized and the class dataset was mapped to numeric
values manually. The dataset was split into 80% for training and 20% for testing using
stratification. Tokenized words were used as features and TFiDF was used for feature
selection. Finally applied Random Forest, Multinomial NB, Logistic Regression, MLP,
XGBoost, SVM, LSTM where LSTM gave them the best result 87% accuracy & f1-score.
In LSTM the author used early stopping to reduce overfitting. After four epoch the LSTM
model gave desired result.

For the classification of Bengali news, a study team from Shahjalal University of Sci-
ence & Technology employed a variety of machine learning-based methods, including
baseline and deep learning models [15]. They employed foundational models including
Naive Bayes, Logistic Regression, Random Forest, Linear SVM, and CNN as well as deep
learning models like BiLSTM. They discovered that the Support Vector Machine which
gave 91% accuracy, used as the base model, and CNN, used for deep learning which gave
93.43%, produced the best results for their dataset.

2.2 Challenges

The rigorous process of compiling and enhancing dataset was the most difficult part of
this undertaking. Stop words, non-English text, Numbers, Unicode and punctuation have
to be eliminated as part of a sequence of rigorous stages. Due to the multiple models
had to be trained, both the machine learning and deep learning models’ training phases
turned out to be very time-consuming. As a result, given the size of the dataset, getting
the final findings from all of these models required a great deal of patience. We began
this project from beginning, including activities like data gathering, data cleaning, data
pre-processing, and the development of all five models. My drive was the primary force
behind the entire procedure.

7
3 Methodology

In the following section, we will guide you through the distinct steps of our methodology
for categorizing Bangla Books. This process involves several crucial stages, including
data collection, data cleaning, data pre-processing, feature extraction, classification, and
evaluation. These steps are essential to ensuring the accuracy and effectiveness of our
approach. Throughout this framework, we have incorporated supporting elements such
as equations, diagrams, figures, and tables to enhance your understanding of the process.
This chapter provides details about our proposed method and other experiments conducted
in this study.

Figure 3.1: Proposed Methodology

8
3.1 Data Collection

To train and evaluate Bangla books, the books [16][17][18][19] were collected from on-
line sources like GitHub and Google Drive, where they have shared their books for per-
sonal use. The books that were collected were in epub format. The epubs can be easily
converted to text. After the epubs were collected, we started to create the dataset, where
we converted the epubs to text format. Then each book was labeled manually with its
author name, author gender, book title, genre, publish date, book text using my own web
application [20] that we created for data labeling, which outputs data as json. The gen-
res that we selected are romance, detective, war, horror, sci-fi, history. To label the book
dataset, we have collected all the information about the books from Rokomari [21] and
GoodReads [22]. This process resulted in a new dataset of labeled Bangla Books that can
be used to train and benchmark Bangla Book Genre Classification. This phase took the
most time, as the books and the book information were not widely available.

Figure 3.2: Data Collection Steps

Figure 3.3: First few rows of dataset

9
3.2 Data Insight

This dataset has a total of six attributes author name, author gender, book title, genre, pub-
lish date, and book text. The dataset, which consists of 118 books in six different genres,
such as romance, detective, war, horror, sci-fi, history, will be used in our experiment with
different algorithms. In the Table (3.1) you can observe the number of data per genre we
have collected and in the Figure (3.3) the percentage of data genre wise. In the Table (3.2)
the author distribution the dataset.

Table 3.1: Bangla Book Dataset

Genre Books
Romance 30
Detective 28
War 21
Sci-fi 15
Horror 14
History 10
Total 118

Figure 3.4: Total percentage data as genre wise

10
Table 3.2: Authorship Distribution

Author Gender Number Genres Date of Birth


of
Books
Humayun Ahmed male 37 horror, war, romance, 13-11-1948
history, sci-fi
Sharadindu Bandyopadhyay male 11 detective, history 30-03-1899
Satyajit Ray male 10 detective, sci-fi 02-05-1921
Sarat Chandra Chattopadhyay male 7 romance 15-09-1876
Sunil Gangopadhyay male 7 detective, sci-fi, his- 07-09-1934
tory
Muhammed Zafar Iqbal male 6 sci-fi, war 23-12-1952
Buddhadeb Guha male 4 romance 29-06-1936
Arthur Conan Doyle male 3 sci-fi 22-05-1859
H. G. Wells male 2 sci-fi 21-09-1866
Syed Mustafa Siraj male 2 detective, horror 14-10-1930
Taradas Bandyopadhyay male 2 horror 15-10-1947
Selina Hossain female 2 war 14-06-1947
Rabindranath Tagore male 2 romance 07-05-1861
Samaresh Majumdar male 2 detective 10-03-1942
Humayun Azad male 2 history, war 28-04-1947
Syed Mujtaba Ali male 1 romance 13-09-1904
Shirshendu Mukhopadhyay male 1 horror 02-11-1935
Suchitra Bhattacharya female 1 detective 10-01-1950
Satyen Sen male 1 war 28-03-1907
Sanjib Chattopadhyay male 1 detective 24-10-1936
Anwar Pasha male 1 war 15-04-1928
Shahaduzzaman male 1 war 20-01-1960
Shahriar Kabir male 1 war 20-11-1950
Asif Siddique Dipro male 1 war
Anisul Hoque male 1 war 04-03-1965
Rebanta Goswami male 1 sci-fi 31-07-1936
Rakib Hasan male 1 detective 01-01-1950
Nihar Ranjan Gupta male 1 detective 06-06-1911
Nimai Bhattacharya male 1 romance 10-04-1931
Trakara Bandyopdhya male 1 romance 23-07-1898
Jahanara Imam female 1 war 03-05-1929
Umakanta Hajari male 1 war
Ahmed Sofa male 1 war 30-06-1943
Aveek Sarkar male 1 horror 09-06-1945

11
3.3 Data Pre-Processing

We have collected around 118 books, which inevitably come with storage and memory
challenges during preprocessing as the text data is from books. To enhance efficiency, we
employed the pandas library to convert the JSON data from the labeling application into
a more manageable CSV format. Given the limitations of a small dataset, we opted to
segment large books into smaller chunks based on their word count [23]. This approach
enables classifiers to concentrate on more manageable and contextually relevant sections,
leading to potential performance enhancements. To determine the optimal chunk size, we
conducted Logistic Regression experiments using various dataset versions, each with a
distinct word count. The word counts ranged from 10,000 to 50,000 words. Remarkably,
the dataset with a word count of 20,000 consistently delivered the most promising results.
Consequently, We concluded that partitioning the dataset into 20,000-word segments was
the most favorable strategy. With this initial conversion complete (See Table 3.2 for after
chunking dataset), We delved into the core data processing phase. Given our focus on
training, there were English words, punctuation, special characters, emojis, white space,
and digit numbers in English and Bangla had to be removed from the dataset to make
the dataset ready for training and testing. In our case, there is a lot of text data as we
are working on books. We eliminated some of this unnecessary data, such as stopwords
[24], a set of words that hold no significant value in our genre classification pursuit, these
are some common words used in Bangla. This removal of stopwords serves a dual pur-
pose: it reduces noise within the data and diminishes dimensionality, thereby speeding up
processing and ultimately amplifying the model’s performance.

Figure 3.5: Data Pre-Processing Steps

12
3.4 Feature Extraction

3.4.1 Term Frequency - Inverse Document Frequency (TF-IDF)

As Machine learning only takes numerical data as input, in this Feature Extraction step,
we used the Term Frequency-Inverse Document Frequency (TF-IDF) technique to further
refine and represent the text data from the books in a manner useful to machine learn-
ing tasks. TF-IDF is often opted as a tool for feature extraction in a variety of Natural
Language Processing (NLP) tasks or text mining tasks [25] [26]. The process includes
tokenizing the text data, creating a document-term matrix, and calculating TF-IDF scores
for each term. This approach captures the importance of terms within individual docu-
ments while considering their rarity across the entire dataset. TF-IDF scores emphasize
significant terms that are both common within documents and unique across the dataset.
By applying TF-IDF, the text data is converted into a structured numerical representation
that retains key information for genre classification. This representation reduces dimen-
sionality, making it suitable for training machine learning models to accurately classify
book genres based on learned patterns and relationships within the data.

n
wi = (T Fi × log (N ÷ ni )) ÷ ∑ (T Fi × log (N ÷ ni ))2 (3.1)
i=1

3.4.2 Bag of Words (BoW)

A Bag of Words (BoW) is a simple and fundamental technique in natural language pro-
cessing (NLP) for text analysis and document classification. It represents a text document
as a collection of individual words or tokens, ignoring grammar and word order. The
resulting "bag" is essentially a frequency distribution of words, where each word in the
document is treated as a separate entity and its occurrence is counted [27].

3.5 Classification

3.5.1 Naïve Bayes Multinomial

Naive Bayes is a classification method based on Bayes’s theorem. It assumes that the
predictors are independent. In simpler terms, it treats each predictor as if it contributed
independently to the outcome. This classifier is particularly useful for large datasets and
often performs well. Bayes’s theorem is a probability concept that deals with conditional

13
probabilities. It helps us calculate the probability of one event happening given that an-
other event has already occurred. This is called a conditional probability. By using past
data, we can use conditional probability to calculate the likelihood of an event occurring.

P(B | A)P(A))
P(A | B) = (3.2)
P(B)

Here,

• P(A) represents the initial probability of hypothesis H being true. This is called
prior probability.

• P(B) stands for the initial probability of evidence.

• P(A|B) indicates the probability of hypothesis being true, given the evidence.

• P(B|A) shows the probability of evidence being true, given the hypothesis.

In our case, Naive Bayes Multinomial (NBM) is a specialized iteration of the Naive
Bayes algorithm tailored for text classification and tasks within natural language pro-
cessing. Naive Bayes Multinomial (NBM) is reviewed comprehensively by several re-
searchers in terms of text classification tasks [28]. Its efficacy shines when handling
discrete data, particularly word counts in documents. The "multinomial" aspect pertains
to its suitability for scenarios involving multiple distinct classes. NBM presents itself
as a straightforward implementation option, yielding noteworthy performance across a
spectrum of text classification tasks, particularly when working with datasets of moderate
scale.

3.5.2 Support Vector Machine

The support vector machine algorithm is well-suited for text classification tasks A Support
Vector Machine (SVM) is a classifier that distinguishes between data by a separating
hyperplane. The hyperplane is an important line that divides the plane into two different
halves, along with every class having their own side. All this happens in two-dimensional
space.

14
Figure 3.6: Support Vector Machine (SVM)

In the SVM (Support Vector Machine) algorithm, our objective is to enhance the sep-
aration between the data points and the hyperplane. Several researchers have thoroughly
studied Support Vector Machine (SVM) in terms of text classification tasks [29]. The loss
function responsible for increasing this separation is known as hinge loss.

C(X,Y, f (X)) = 0i fY ∗ f (X) >= 1, else1 −Y ∗ f (X) (3.3)

The cost is minimized to zero when the predicted value and the actual value share the
same sign. However, if they have opposite signs, the algorithm calculates the loss by de-
termining the extent of this mismatch and incorporating a regularization term into the cost
function. This regularization term is introduced to balance the trade-off between maxi-
mizing the separation margin and minimizing the loss. Upon adding this regularization
term, the cost functions take on the following form:

mn̈nw λ ∥w∥2 + ∑ (1 − yi ⟨xi , ω ⟩)+ (3.4)


i=i

Utilizing fractional derivatives with respect to the weights enables the determination of
the directions in which the weights should be updated. In the event of a misclassification,
this process involves incorporating the loss along with the regularization term to facilitate
the gradient update.
w = w + α · (yi · xi − 2λ w) (3.5)

15
3.5.3 Logistic Regression

Logistic Regression stands out as a prominent classification algorithm renowned for its
effectiveness in handling categorical data. This algorithm proves particularly valuable
when dealing with scenarios involving binary outcomes. The essence of Logistic Regres-
sion lies in predicting the probability of a specific outcome, denoted as P(Y = 1), based
on the input data, X. The resulting prediction resides within the range of 0 to 1 and is
facilitated by the construction of a logistic curve [30]. Logistic Regression finds extensive
utility in various data modeling tasks, such as identifying spam, analyzing movie reviews
as positive or negative, and detecting tumor malignancy. It occupies a significant position
within a category of models known as generalized linear models. While Logistic Regres-
sion and linear regression share resemblances, they diverge in their curve construction. In
Logistic Regression, the curve is shaped by the natural logarithm of the odds of the target
variable. Distinct variations of Logistic Regression cater to different scenarios. In the case
of binomial or binary logistic regression, only two possible outcomes exist. Multinomial
logistic regression accommodates situations where three or more non-ordered categories
are feasible. On the other hand, ordinal logistic regression finds its niche in handling
ordered dependent variables. To illustrate, consider Figure 2.1, which offers an exempli-
fication of a basic binary logistic regression model.

Figure 3.7: Comparison Between Linear model and Logistic model

16
3.5.4 Multi-layer Perceptron (MLP)

The Multilayer Perceptron (MLP) stands as a widely embraced and prevalent neural net-
work model in the realm of deep learning [31]. This model comprises three distinct layer
types. Firstly, the input layer takes in all pertinent inputs. On the flip side, the output
layer constitutes the final stage of the network. In our specific scenario, the output layer
encompasses 6 nodes to accommodate the 6 available classes. Between the input and out-
put layers lies the hidden layera sequential network of perceptrons. Conceivably, multiple
hidden layers may exist within an MLP, although we have opted for a default configuration
of 1000 nodes in our case. Our approach involves a maximum of 200 iterations.Within
the fully connected network of perceptrons, each node emerges as a linear amalgamation
of weighted contributions from nodes in the preceding layers. These nodes subsequently
undergo an activation function. Stated differently, the value of each node corresponds to
the summation of products between its connected values and associated weights. This
procedural sequence ensures the computation of every layer, culminating in the final out-
put layer and the realization of desired outcomes. For an illustrative depiction of the MLP
architecture, refer to figure (3.8)

Figure 3.8: MLP architecture

3.5.5 CNN

A Convolutional Neural Network (CNN) is a specific type of artificial neural architecture


used in supervised learning tasks within machine learning [32]. CNNs employ percep-

17
trons as building blocks for analyzing data. These networks utilize three-dimensional
layers in which only a subset of neurons maintains connections with the preceding layer.
The structure of CNNs comprises various layers, including convolutional layers (kernels),
pooling layers, rectified linear unit (ReLU) layers, and fully connected layers. The out-
put of these layers is then fed into the neurons of a neural network. The inspiration for
deep convolutional neural networks comes from their incorporation of local connections
between layers and their capacity to achieve spatial invariance. A common practice in
CNN architecture involves interleaving pooling layers with successive convolutional lay-
ers (kernels) to mitigate overfitting by reducing the number of parameters and computa-
tional demands. CNNs excel in extracting pertinent information from input data, and by
reshaping this input data, they prepare it for further application of methods like Artificial
Neural Networks.

3.6 Evaluation Matrix

3.6.1 Accuracy

Classification Accuracy in machine learning is a fundamental performance metric that


measures the correctness of a model’s predictions. It quantifies the proportion of correctly
classified instances out of the total instances in a dataset. Expressed as a percentage,
accuracy is a straightforward way to assess how well a model performs on a given task.

Number o f Correct predictions


Accuracy = (3.6)
Total number o f predictions made

However, it has limitations, particularly when dealing with imbalanced datasets where
one class is significantly more prevalent than others. In such cases, a high accuracy might
not reflect the true effectiveness of a model, as it could be heavily biased toward the
dominant class. Thus, while accuracy provides a quick overview of a model’s overall
performance, it should be interpreted alongside other metrics, especially when facing
complex or imbalanced data distributions.

3.6.2 Confusion Matrix

The confusion matrix serves as a tool to assess the performance of a classification algo-
rithm. It becomes particularly valuable in scenarios where there is an unequal distribution
of samples across categories or when dealing with datasets containing more than two

18
groups. Relying solely on classification accuracy can be misleading under these circum-
stances. Hence, the confusion matrix is utilized to provide a clearer representation of
outcomes. Within the confusion matrix, the counts of predicted and actual cases for both
positive and negative outcomes are presented. This framework offers insights into the
model’s behavior and its effectiveness in differentiating between classes.

Figure 3.9: Confusion Matrix

There are 4 important terms :

• True Positives (TP) : The cases in which we predicted YES and the actual output
was also YES.

• True Negatives (TN) : The cases in which we predicted NO and the actual output
was NO.

• False Positives (FP) : The cases in which we predicted YES and the actual output
was NO.

• False Negatives (FN) : The cases in which we predicted NO and the actual output
was YES.

3.6.3 Precision

Precision in machine learning is a measure that focuses on how accurate a model is when
it predicts a positive outcome. It looks at the proportion of correctly predicted positive

19
cases out of all the cases the model predicted as positive. In simple terms, precision
helps us understand how good the model is at avoiding false positives situations where it
wrongly predicts something as positive when it’s actually not. A high precision indicates
that when the model says something is positive, it’s usually right.

True Positive
Precision = (3.7)
(True Positive + False Positive)

3.6.4 Recall

Recall in machine learning is a measure that focuses on capturing all the relevant positive
cases. It calculates the proportion of correctly predicted positive cases out of all the actual
positive cases in the dataset. In simpler terms, recall helps us understand how well the
model is at finding all the important things it should find.

True Positives
Recall = (3.8)
(True Positives + False Negatives)

3.6.5 F1-Score

The F1 score in machine learning combines precision and recall metrics to provide a fair
assessment of the performance of a model. This is especially helpful when trying to
strike a balance between reducing false positives and false negatives. The harmonic mean
of recall and precision was used to construct the F1 score, with lower values between
the two being given greater weight. In other words, the F1 score penalizes models that
significantly favor one over the other and rewards those that have high precision and recall.
It is like finding the ideal balance between being cautious not to make too many errors
and still ensuring that crucial items are not missed. When assessing a model’s overall
performance, particularly in circumstances where accuracy and recall must be balanced,
such as text classification or spam detection, the F1 score is useful.

recall × precision
F1 − Score = 2 × (3.9)
recall + precision

20
4 Experimental Evaluation

In this section, we present a comprehensive overview of the output and behavioral pro-
cesses utilized in the Bangla Book Genre Classification across all models. The evaluation
approach encompasses several crucial steps, including the creation of confusion matrices
and the generation of Classification Reports. Together, these steps provide insights into
the strengths and weaknesses of the models. The confusion matrices visually depict the
models’ predictions compared to the actual genres, facilitating a clear understanding of
their areas of proficiency and limitations. These matrices are a valuable tool for identify-
ing misclassification patterns and highlighting genres that may challenge the models. In
addition to the confusion matrices, the Classification Reports offer a more detailed analy-
sis of the models’ performance. They include metrics like precision, recall, F1-score, and
support for each genre category, providing a comprehensive breakdown of their capabili-
ties.

4.1 Findings

4.1.1 Logistic Regression

Table (4.1) and Table (4.2) presents the classification report for our logistic regression
analysis. Our experimental results yielded an impressive accuracy of 85% for this model
with Bag of Words (BoW) and 89% with TF-IDF, signifying a notable achievement in its
performance. TF-IDF giving 4.7% more accuracy than BoW.

21
Table 4.1: Logistic Regression Classification Report (BoW)

Class Precision Recall F1 Score


Detective 0.90 0.95 0.92
History 0.95 0.90 0.92
Horror 1.00 0.44 0.62
Romance 0.68 1.00 0.81
Sci-fi 0.86 0.75 0.80
War 0.89 0.73 0.80
Accuracy 0.85
Weighted avg 0.87 0.85 0.84

Table 4.2: Logistic Regression Classification Report (TF-IDF)

Class Precision Recall F1 Score


Detective 0.90 0.95 0.92
History 0.87 1.00 0.93
Horror 1.00 0.38 0.55
Romance 0.81 1.00 0.89
Sci-fi 1.00 0.67 0.80
War 1.00 1.00 1.00
Accuracy 0.89
Weighted avg 0.91 0.89 0.88

Figure 4.1: Confusion Matrix of Logistic Regression - BoW (left) and TF-
IDF(right)

22
4.1.2 Support Vector Machine

Table (4.3) and Table (4.4) presents the classification report for our Support Vector Ma-
chine . Our experimental results yielded an impressive accuracy of 83% for this model
with Bow and 94% accuracy with TF-IDF, signifying a notable achievement in its performance.TF-
IDF giving 13.25% more accuracy than BoW in this model. The optimal set of parameters
that we used ’C’: 10, ’degree’: 2, ’gamma’: 0.1, ’kernel’: ’linear’.

Table 4.3: Support Vector Machine Classification Report (BoW)

Class Precision Recall F1 Score


Detective 0.90 0.95 0.92
History 0.95 0.90 0.92
Horror 1.00 0.44 0.62
Romance 0.63 1.00 0.77
Sci-fi 1.00 0.62 0.77
war 0.89 0.73 0.80
Accuracy 0.83
Weighted avg 0.88 0.83 0.83

Table 4.4: Support Vector Machine Classification Report (TF-IDF)

Class Precision Recall F1 Score


Detective 1.00 1.00 1.00
History 1.00 1.00 1.00
Horror 1.00 0.50 0.67
Romance 0.85 1.00 0.92
Sci-fi 1.00 0.89 0.94
War 0.85 1.00 0.92
Accuracy 0.94
Weighted avg 0.95 0.94 0.93

23
Figure 4.2: Confusion Matrix of Support Vector Machine - BoW (left) and TF-
IDF(right)

4.1.3 Naïve Bayes Multinomial

In our classifier analysis, a noteworthy enhancement in performance has been observed


by adjusting the alpha value. The alpha parameter plays a role in smoothing within the
scikit-learn API’s multinomial Naïve Bayes classifier. By default, alpha is set to 1.0,
however we have experimented with a value of 0.01, leading to a notable improvement in
the classification outcomes. Naïve Bayes Multinomial confusion matrix is represented in
figure (4.3). Table ( 4.5) and Table (4.6) indicates the Naïve Bayes Multinomial classi-
fication report. Our experimental results yielded an impressive accuracy of 85% for this
model with Bow and 94% accuracy with TF-IDF, signifying a notable achievement in its
performance.TF-IDF giving 10.59% more accuracy than BoW in this model.

Table 4.5: Naïve Bayes Multinomial Classification Report (BoW)

Class Precision Recall F1 Score


Detective 0.94 0.79 0.86
History 0.86 0.95 0.90
Horror 1.00 0.56 0.71
Romance 0.82 0.82 0.82
Sci-fi 0.89 1.00 0.94
War 0.67 0.91 0.77
Accuracy 0.85
Weighted avg 0.86 0.85 0.84

24
Table 4.6: Naïve Bayes Multinomial Classification Report (TF-IDF)

Class Precision Recall F1 Score


Detective 1.00 0.95 0.97
History 0.95 1.00 0.98
Horror 1.00 0.62 0.77
Romance 0.89 1.00 0.94
Sci-fi 1.00 0.89 0.94
War 0.85 1.00 0.92
Accuracy 0.94
Weighted avg 0.95 0.94 0.94

Figure 4.3: Confusion Matrix of Naïve Bayes Multinomial - BoW (left) and
TF-IDF (right)

4.1.4 MLP

Here, we used a hidden layer of 1000 and used early stopping so the model does not
overfit. Table (4.7) and Table (4.8) presents the classification report for our MLP . Our
experimental results yielded an impressive accuracy of 88% for this model with Bow and
93% accuracy with TF-IDF, signifying a notable achievement in its performance.TF-IDF
giving 5.68% more accuracy than BoW in this model.

25
Table 4.7: MLP Classification Report (BoW)

Class Precision Recall F1 Score


Detective 1.00 0.79 0.88
History 0.95 0.95 0.95
Horror 1.00 0.56 0.71
Romance 0.81 1.00 0.89
Sci-fi 0.80 1.00 0.89
War 0.77 0.91 0.83
Accuracy 0.88
Weighted avg 0.90 0.88 0.88

Table 4.8: MLP Classification Report (TF-IDF)

Class Precision Recall F1 Score


Detective 0.95 0.95 0.95
History 0.95 1.00 0.98
Horror 0.75 0.67 0.71
Romance 0.89 1.00 0.94
Sci-fi 1.00 1.00 1.00
War 1.00 0.82 0.90
Accuracy 0.93
Weighted avg 0.93 0.93 0.93

Figure 4.4: Confusion Matrix of MLP - BoW (left) and TF-IDF(right)

26
4.1.5 CNN

Early Stopping, implemented through the callbacks API, was employed in our study to
enhance the training process of our Convolutional Neural Network (CNN) model. Our
chosen metric for monitoring was Sparse Categorical Accuracy. Specifically, we aimed
to halt the training procedure if no discernible improvement in validation accuracy was
observed for a consecutive span of five epochs. This strategy was implemented to prevent
overfitting and optimize the model’s generalization capabilities. As depicted in Figure
(4.5), the graph illustrates the trend in accuracy in left and the corresponding loss values in
right throughout the training process using BoW. Also in Figure (4.6), the graph illustrates
the trend in accuracy in left and the corresponding loss values in right throughout the
training process using TF-IDF. It is important to note that the weights utilized for the
model were those retrieved from the epoch that exhibited the highest validation accuracy.
This approach ensures that the model is equipped with the most optimal parameter values,
thereby potentially improving its performance on unseen data. This technique aligns with
our overarching goal of training an efficient and effective CNN model for the task at hand.

Figure 4.5: CNN Model Accuracy (left) and Loss (right) using BoW

27
Figure 4.6: CNN Model Accuracy (left) and Loss (right) using TF-IDF

Table (4.9) and Table (4.10) displays the CNN classification report, offering an overview
of the model’s classification performance. Additionally, Figure (4.7) presents the CNN
confusion matrix, visually summarizing the alignment between predicted and actual class
labels with both text representation technique BoW in left and TF-IDF in right.

Table 4.9: CNN Classification Report (BoW)

Class Precision Recall F1 Score


Detective 0.88 0.79 0.83
History 0.90 0.95 0.93
Horror 0.62 0.56 0.59
Romance 0.68 0.88 0.77
Sci-fi 0.83 0.62 0.71
War 0.90 0.82 0.86
Accuracy 0.81
Weighted avg 0.82 0.81 0.81

28
Table 4.10: CNN Classification Report (TF-IDF)

Class Precision Recall F1 Score


Detective 1.00 0.95 0.97
History 0.87 1.00 0.93
Horror 1.00 0.67 0.80
Romance 0.94 1.00 0.97
Sci-fi 1.00 0.88 0.93
War 0.83 0.91 0.87
Accuracy 0.93
Weighted avg 0.94 0.93 0.93

Figure 4.7: Confusion Matrix of CNN - BoW (left) and TF-IDF(right)

4.1.6 Result Summary

In summary, SVM and Naive Bayes Multinomial with feature extraction TF-IDF gave
the best result among all the algorithms of 0.94 accuracy and CNN with feature extrac-
tion Bow has given the least accuracy 0.81. The performance comparison between our
classifiers is shown below. In figure (4.8) and figure (4.9) show the performance compar-
ison of different machine learning and neural network techniques. In Table (4.11) gives
you a summary of all the models we have ran with BoW and TF-IDF feature extraction
techniques.

29
Figure 4.8: Model Evaluation of Different Models (BoW)

Figure 4.9: Model Evaluation of Different Model (TF-IDF)

30
Table 4.11: Performance comparison

Classifier Feature Extraction Accuracy Precision Recall F1-score


BoW 0.83 0.88 0.83 0.83
SVM
TF-IDF 0.94 0.95 0.90 0.91
BoW 0.85 0.86 0.85 0.84
Naive Bayes Multinomial
TF-IDF 0.94 0.95 0.94 0.94
BoW 0.81 0.82 0.81 0.81
CNN
TF-IDF 0.93 0.93 0.94 0.93
BoW 0.88 0.90 0.88 0.88
MLP
TF-IDF 0.92 0.92 0.92 0.91
BoW 0.85 0.87 0.85 0.84
Logistic Regression
TF-IDF 0.89 0.91 0.89 0.88

4.2 Comparing with Similar Papers

As we couldnt find any dataset on Bangla book genre classification, we decided to com-
pare our results with similar text classification works. Table (4.12) shows the comparison
of our work with some similar works which are previously done. By comparing with some
previous works, we can see that our dataset is quite smaller than other research however
our results are promising.

Table 4.12: Comparing Our Work with Previous Work (Accuracy)

Paper Name Dataset Class Samples Performance


[12] 5 169791 95.50%
[15] 10 3000 93.43%
[14] 12 75951 87.00%
[13] 12 95853 84.94%
In our work 6 118 94.00%

31
5 Conclusion

This chapter offers the thesis’s final remarks, a list of its shortcomings, and information
on where it will go from here. The thesis’ Conclusion is stated in section 5.1. Limitations
are explored in part 5.2, and the thesis’s future path is defined in section 5.3.

5.1 Conclusion

In this thesis, we give a thorough analysis of the categorization of Bangla book genres us-
ing a variety of machine learning and neural network based techniques. We used machine
learning based Logistic Regression, Support Vector Machine (SVM), Naïve Bayes Multi-
nomial and also some neural network based Multilayer perceptron(MLP), Convolutional
Neural Network (CNN). Where SVM and Naive Bayes Multinomial gave the best result
among all the algorithms of 94% accuracy. All of this model can categorize the six types
of genre. We built an extensive system that can categorize Bangla books textual input data
and provide prediction of genre.

5.2 Limitations

The limitation is of this work is the amount of data. Because of small dataset the horror
genre cannot be predicted accurately. The horror books gets incorrectly classified since
some other books genre can have similar characteristics. More horror book can solve this
problem. And a book can have characteristics of multiple genres, implementing multi-
label categorization can improve this limitation.

32
5.3 Future Work

The performance and application of the system may yet be improved, despite the fact that
our study has made considerable strides in Bangla book genre classification. Future study
might focus on a number of issues, including:

• Add more books to our dataset. This expanded dataset will provide a more com-
prehensive and robust foundation for your analysis, research, or machine learning
models.

• Adding more variation to our data.

• Adding Bangla stemming and lemmatization. For the shortage of having good qual-
ity Bangla stemming and lemmatization library we did not use one.

• We can use other feature selection methods that were used in other languages that
we did not use

• Implementing Multi-label categorization to better classify books that fall under


many genres yet have comparable contents.

33
References

[1] Forecast number of mobile devices worldwide from 2020 to 2025.


[Online]. Available: https://www.statista.com/statistics/245501/multiple-mobile-
device-ownership-worldwide/

[2] S. Tong and D. Koller, “Support vector machine active learning with applications
to text classification,” Journal of machine learning research, vol. 2, no. Nov, pp.
45–66, 2001.

[3] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text
categorization,” in Third annual symposium on document analysis and information
retrieval, vol. 33, 1994, pp. 81–93.

[4] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint
arXiv:1408.5882, 2014.

[5] M. S. R. R. Karim and M. Z. Iqbal, “Recognition of spoken letters in bangla,” 5th


international conference on computer and information technology (ICCIT02), 2002.

[6] Ethnologue: Languages of the world. [Online]. Available:


https://www.ethnologue.com/

[7] A. I. Kadhim, “Survey on supervised machine learning techniques for automatic text
classification,” Artificial Intelligence Review, vol. 52, no. 1, pp. 273–292, 2019.

[8] G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, and F. E. Alsaadi, “Evaluation of feature
selection methods for text classification with small datasets using multiple criteria
decision-making methods,” Applied Soft Computing, vol. 86, p. 105836, 2020.

34
[9] B. Y. Panchal, “Book genre categorization using machine learning algorithms (k-
nearest neighbor, support vector machine and logistic regression) using customized
dataset,” March 2021, available at SSRN: https://ssrn.com/abstract=3805945.
[Online]. Available: https://ssrn.com/abstract=3805945

[10] C. U, G. JJ, R. Calvo, and C. H.A, “Automatic classification of news articles in


spanish,” 01 2004.

[11] I. Rasheed, V. Gupta, H. Banka, and C. Kumar, “Urdu text classification: A com-
parative study using machine learning techniques,” in 2018 Thirteenth International
Conference on Digital Information Management (ICDIM), 2018, pp. 274–278.

[12] S. Yeasmin, R. Kuri, A. R. M. M. H. Rana, A. Uddin, A. Q. M. S. U.


Pathan, and H. Riaz, “Multi-category bangla news classification using machine
learning classifiers and multi-layer dense neural network,” International Journal
of Advanced Computer Science and Applications, vol. 12, no. 5, 2021. [Online].
Available: http://dx.doi.org/10.14569/IJACSA.2021.0120588

[13] S. Alam, M. A. U. Haque, and A. Rahman, “Bengali text categorization based on


deep hybrid cnnlstm network with word embedding,” in 2022 International Confer-
ence on Innovations in Science, Engineering and Technology (ICISET), 2022, pp.
577–582.

[14] K. Salehin, M. K. Alam, M. A. Nabi, F. Ahmed, and F. B. Ashraf, “A compara-


tive study of different text classification approaches for bangla news classification,”
in 2021 24th International Conference on Computer and Information Technology
(ICCIT), 2021, pp. 1–6.

[15] M. Hossain, S. Sarkar, and M. Rahman, “Different machine learning based ap-
proaches of baseline and deep learning models for bengali news categorization,”
International Journal of Computer Applications, vol. 176, pp. 10–16, 04 2020.

[16] eedeidk. bongboi / . [Online]. Available: https://github.com/eedeidk/bongboi

[17] eboipotro. eboipotro. [Online]. Available: https://github.com/eboipotro

[18] Ala’s kindle. [Online]. Available:


https://drive.google.com/drive/u/0/folders/1CnhV0AqbvCsuAMpNsCQem4JR7Z9gXAyw

35
[19] Pinu’s kindle. [Online]. Available:
https://drive.google.com/drive/u/0/folders/1FAukxq7IzhUgqKA9VRxShObgN ipV 2G

[20] A. S. Arnob. Book data collection web app. [Online]. Available:


https://github.com/ShafiArnob/book-data-collection-react

[21] (2023) Rokomari book store. Accessed on June 01, 2023. [Online]. Available:
https://www.rokomari.com/book

[22] (2023) Goodreads. Accessed on June 01, 2023. [Online]. Available:


https://www.goodreads.com/?ref=navh om

[23] S. Todeschini. How to chunk text data a comparative analysis. [Online]. Avail-
able: https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-
3858c4a0997a

[24] N. Afnan. Bangla stopwords. [Online]. Available:


https://www.kaggle.com/datasets/nuhashafnan/bangla-stopwords

[25] G. Forman and I. Cohen, “Learning from little: Comparison of classifiers given little
training,” in Knowledge Discovery in Databases: PKDD 2004, J.-F. Boulicaut, F. Espos-
ito, F. Giannotti, and D. Pedreschi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2004, pp. 161–172.

[26] K. Masuda, T. Matsuzaki, and J. Tsujii, “Semantic search based on the online
integration of nlp techniques,” Procedia - Social and Behavioral Sciences, vol. 27,
pp. 281–290, 2011, computational Linguistics and Related Fields. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1877042811024360

[27] S. Yeasmin, R. Kuri, A. R. M. M. H. Rana, A. Uddin, A. Q. M. S. U. Pathan,


and H. Riaz, “Multi-category bangla news classification using machine learning
classifiers and multi-layer dense neural network,” International Journal of Advanced
Computer Science and Applications, vol. 12, no. 5, 2021. [Online]. Available:
http://dx.doi.org/10.14569/IJACSA.2021.0120588

[28] P. Bolaj and S. Govilkar, “Text classification for marathi documents using supervised
learning methods,” Int. J. Comput. Appl, vol. 155, no. 8, pp. 6–10, 2016.

36
[29] Z.-q. Wang, X. Sun, D.-x. Zhang, and X. Li, “An optimal svm-based text classification al-
gorithm,” in 2006 International Conference on Machine Learning and Cybernetics, 2006,
pp. 1378–1381.

[30] M. A. Ul Haque, A. Rahman, and M. M. A. Hashem, “Sentiment analysis in low-resource


bangla text using active learning,” in 2021 5th International Conference on Electrical
Information and Communication Technology (EICT), 2021, pp. 1–6.

[31] S. K. Srivastava, S. K. Singh, and J. S. Suri, “Chapter 16 - a healthcare text classification


system and its performance evaluation: a source of better intelligence by characterizing
healthcare text,” in Cognitive Informatics, Computer Modelling, and Cognitive Science,
G. Sinha and J. S. Suri, Eds. Academic Press, 2020, pp. 319–369. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/B9780128194454000163

[32] Z. Wang and Z. Qu, “Research on web text classification algorithm based on improved cnn
and svm,” in 2017 IEEE 17th International Conference on Communication Technology
(ICCT), 2017, pp. 1958–1961.

37

You might also like