Academia.eduAcademia.edu

Book Genre Prediction

2021, IJRASET

The present work aims to classify the genre of the books automatically using the Python programming language. A genre is a subset of art, literature, or music that has a distinct form, substance, and style. In many instances, a book can be classified as belonging to more than one genre. It's difficult to categorize a book or piece of literature as belonging to one genre over another. Many novels end up badly categorized or pushed under the super-genre umbrella of fiction since there is no clear criterion to determine how much of a book belongs to a given genre. Therefore, it's critical to develop a system for categorizing books and determining their relevance to a particular genre. Therefore, the current study tries to solve this challenge by combining various text categorization approaches and models to come up with the best solution. I.

9 X https://doi.org/10.22214/ijraset.2021.38409 October 2021 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com Book Genre Prediction Priyal Desai1, Ghata Saraiya 2, Maliha Nan3 1, 2, 3 Computer Engineering Dept., Dharmsinh Desai University, Nadiad, Gujarat, India Abstract: The present work aims to classify the genre of the books automatically using the Python programming language. A genre is a subset of art, literature, or music that has a distinct form, substance, and style. In many instances, a book can be classified as belonging to more than one genre. It's difficult to categorize a book or piece of literature as belonging to one genre over another. Many novels end up badly categorized or pushed under the super-genre umbrella of fiction since there is no clear criterion to determine how much of a book belongs to a given genre. Therefore, it's critical to develop a system for categorizing books and determining their relevance to a particular genre. Therefore, the current study tries to solve this challenge by combining various text categorization approaches and models to come up with the best solution. I. INTRODUCTION Big Data technologies are intimately linked to the science of artificial intelligence. Natural language analysis is one of its specialties. Computers can be taught to recognize particular patterns in processed texts and classify words, phrases, or even entire documents into predefined groups based on these patterns. One can easily configure such a project utilizing open-source instruments, which are capable of classifying text based on a preceding automatic learning phase and preset input data. Many natural language processing (NLP) machine learning algorithms include a statistical model, in which judgments are determined using a probabilistic approach. Deep learning algorithms have also been used recently, with excellent results. Text fragments make up input data, which can be basic word sequences, whole sentences, or even entire papers. The text is altered, and different data features are given varying weights. Machine learning models are developed using the input data and can then be applied to fresh, unexpected data. In comparison to alternative linguistics models, such algorithms can learn from data and are better at interpreting new or erroneous input, such as spelling problems or missing words. Linguistic models are built on a set of established grammatical rules that are prone to errors when dealing with unfamiliar or wrong input, as well as being more difficult to maintain when dealing with big and complicated systems. The accuracy of machine learning models is proportional to the amount of the input data. Providing additional texts from which the model can learn will boost the new processed data's prediction outcomes. Natural Language Processing encompasses a wide range of disciplines, including part of speech tagging and named entity recognition (which aims to locate and identify named entities such as people, places, and organizations), machine translation, speech recognition, question answering, sentiment analysis, etc. The present work focuses on using the Python programming language to forecast the genre of a book. This study is based on how books are classified based on their summaries. The proposed theory is that novels can be classified based on their written summaries' word content. The model will be used to categorize new books into predefined categories once it has been trained using a dataset. One of the long-term goals is to make book classification easier and to inform consumers about possible genres and genre overlap. This may make it easier to categorize novels as belonging to multiple genres. The classification of literary works differs greatly from the classification of ordinary texts. The length of books, which is far greater than most other text media, is one of the major reasons behind this. As a result, we'll be working with book summaries rather than the complete text. II. LITERATURE REVIEW Topic segmentation and recognition are two capabilities of Natural Language Processing. This necessitates a set of input data as well as some machine learning models capable of categorizing the text into various subjects. Unsupervised and supervised learning are two approaches to the problem. Unsupervised algorithms employ input data that has not been hand-annotated with the correct class or topic, whereas supervised algorithms use data that has been labeled with the relevant class or topic. Unsupervised learning is generally more difficult and produces less accurate outcomes than supervised learning. Nonetheless, the volume of data that has not been labeled is much bigger than the data that has been allocated to the correct classes, and in some cases, an unsupervised approach is the only alternative. © IJRASET: All Right s are Reserved 593 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com Unsupervised NLP methods use machine learning clustering algorithms to divide text data into segments and identify the class of each group. Surveillance algorithms, on the other hand, necessitate a set of textual data with the appropriate labels pre-filled. What is the best way to get a text that has been labeled? You can use data that has already been assigned to a category, such as movies, a genre, product categories, document categories, or comment subjects. An NLP supervised classification algorithm will examine the input data and should be able to determine which topic or class a new text should belong to based on the current classes found in the train data. To some extent, the level of confidence is proportional to the amount of accessible training data, as well as the similarity between the new incoming text and the ones from which the model has learned. The majority of algorithms additionally show the amount of confidence in a correct match, which is usually between 0 and 1. The user can choose a forecast accuracy criterion, which will lead to a decision about when to dismiss the result. This outcome can take the form of one or more classes determined by the model. A supervised NLP algorithm's end-to-end flow includes acquiring labeled data, data preparation, developing a classification model, and utilizing the model to forecast the topic of a new text. III. METHODOLOGY Evaluating and discovering the appropriate dataset is a very crucial step. In the current study, two datasets were discovered that are discussed in the further subsection. A. CMU Book Summary The first dataset is “CMU Book Summary” which includes plot summaries for 16,559 novels collected from Wikipedia, as well as aligned metadata from Freebase, such as author, title, and genre. The dataset contains 179 distinct genres with a very asymmetric distribution. To balance the distribution, we only preserve mainstream genres and delete books from genres such as "anti-war." The number of books published in each genre continues to remain unbalanced. In order to make the distribution more even, we will use data augmentation in subsequent rounds. The number of genres kept, and their distribution are as follows. Figure 1: Distribution of the CMU Book Summary Dataset © IJRASET: All Right s are Reserved 594 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com 1) Data Preparation: A list of documents made up of word sequences or entire phrases is the starting point for document classification. In their original form, they can't be used as machine learning features. These articles must be divided and turned into useful features for a machine learning model. In this concern, the Natural Language Toolkit was used to clean the summaries. The following steps were followed for the pre-processing process: a) Conversion to lowercase b) Eliminating the punctuation marks c) Eliminating Stop Words d) Word-to-number conversion e) Stemming 2) Data Augmentation: In computer vision, data augmentation is often utilized. It may very likely flip, rotate, or mirror a picture in vision without risking affecting the original label. However, there is a distinction when working with text, particularly summaries. For data augmentation, the application of a single basic process that does not modify the book's category was done. Moreover, synonym replacement was done i.e., Pick n words from the sentence that doesn't stop words at random. Each of these words should be replaced with a synonym picked at random. According to the distribution, there will be an increase in records in genres with fewer records. Figure 2 depicts the distribution of the CMU dataset after the data augmentation process. It is to be noted that this distribution is more balanced than the previous distribution. Figure 2: Distribution of the CMU dataset after data augmentation The next step was to develop a classifier and fit the dataset. To perform that, a pipeline was utilized that vectorizes the data before it is fed to the classifier. © IJRASET: All Right s are Reserved 595 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com B. The Blurb Genre Collection The second dataset is “The Blurb Genre Collection” which contains blurbs (advertising descriptions of books) and aligned metadata for 91,982 books collected from Penguin Random House. The data distribution of The Blurb Genre Collection is demonstrated in figure 3. Furthermore, multiple genres were assigned to each summary that is shown in figure 4. Figure 3: Distribution of the Blurb Genre Collection Figure 4: Genres as per summary © IJRASET: All Right s are Reserved 596 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com C. Data Preparation As the data is in XML format, the first step was to extract the information from the file and turn it into a data frame. Every blurb is divided into one or more categories. The categories are arranged in a hierarchical order. Each document in the collection must be assigned at least one category, according to the minimum coding policy. Every ancestor of a document's label gets allocated as well, thanks to the hierarchy policy. We prepared lists of genres to which each blurb belongs to make data access easier. D. Multilabel Classification Multi-label classification is a generalization of multi-class classification, which is the single-label problem of categorizing instances into precisely one of more than two classes. In the multi-label problem, there is no limit to how many classes an instance can be assigned to, so the output data used for training could have one, two, or many labels. The F1 metric was used. The harmonic mean of precision and recall is used to determine the score. (F1 Score = 2 * (precision * recall) / (precision + recall)). This F1 score is micro averaged before being used as a multi-class classification metric. The value of true positives, false positives, true negatives, and false negatives is counted to calculate it. In this scenario, all of the projected outputs are column indices, and they are utilized in sorted order by default. To modify the dataset to the multilabel, initially, the modification of the dataset was done to create a binary matrix such that there is a separate column for the genre. There are 139 different genres to choose from. If a blurb belongs to a genre, it will have a value of 1 in that column; otherwise, it will have a value of 0. Another method takes advantage of scikitmultilabel binarized. The intuitive format is converted to the supported multilabel format, which is a (samples x classes) binary matrix indicating the presence of a class label. The next step was to use the TF-IDF vectorizer from the scikit-learn library to vectorize the summaries. For multi-label prediction, the use of a One vs. Rest classifier was done after vectorizing. This technique, often known as one-vs-all, consists of fitting one classifier per class. The class is fitted against all the other classes for each classifier. This technique has the advantage of being interpretable, in addition to being computationally efficient (just n classes classifiers are required). Because each class is represented by only one classifier, inspecting its matching classifier can provide information about the class. This is the most frequent multiclass classification approach, and it's a good starting point. This approach may also be used for multilabel learning, which involves fitting a classifier to a 2-d matrix in which cell [i, j] is 1 if sample i contains label j and 0 otherwise. It is needed to provide an estimator as a parameter to the OneVsRestClassifier, an estimator is an object implementing fit and one of decision_function or predict-proba. In the context of the present study, two estimators can be used i.e., Linear Support Vector Machine and Logistic Regression. After this point, it was trained the classifier using through the fit function. IV. RESULT A. CMU Book Summary To find the accuracy of the developed system for book genre prediction in the CMU Book Summary dataset, the predict function was applied to predict the multi-class targets using underlying estimators. The output of the results is presented in figure 5. Apparently, the SGD classifier achieves an accuracy of 0.81 for single label genre prediction (Stochastic Gradient Descent). Except for genres with fewer records, we get balanced precision and recall levels for all genres. However, for books, a single genre prediction is typically insufficient; we need numerous genre predictions for the model to be useful. Figure 5: Output © IJRASET: All Right s are Reserved 597 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com B. The Blurb Genre Collection For the Blurb Genre Collection, to estimate the probability, predict_proba function was used that gives the output of 0.7136495754642689. Therefore, using this model, the F1 score achieved is 0.713. The estimated values for all classes are sorted by class label. It's worth noting that each sample in the multilabel instance can have any number of labels. The marginal probability that the provided sample contains the label in issue is returned. It is perfectly consistent, for example, that two labels have a 90% chance of applying to the same sample. Now, to identify or affirm that whether the book belongs to that genre, the threshold value was utilized. Using various thresholds on the same prediction allows us to find the best value, which in our instance is 0.25. When tags are chosen based on a lower threshold value, too many tags are chosen, lowering the F1 metric score, and when the threshold value is very big, nearly no tags are picked, lowering the performance metric score. In the next step, inverse_transform was used to obtain the string values of the classes from the binarizer. Figure 6 depicts the actual label and the predicted labels of the books. In addition, the accuracy of correct prediction is also obtained which is demonstrated in figure 7. Figure 6: actual label and predicted labels Figure 7: Accuracy of prediction C. Multi-feature Classification The book's summary is solely used to predict its genre in the overall model. However, other factors such as the book's title and author may have an impact on the genre. As a result, the use of various features to classify the data is explained in this section. The first step was to convert the row into a dictionary. Then, applying the various classifier. It was observed that the best accuracy comes from the Naive Bayes classifier. Naive Bayes classifiers, a subset of classifiers based on the well-known Bayes' probability theorem, are well-known for producing simple yet effective models, particularly in the field of document categorization. The Naive Bayes method was shown to be the most efficient in this investigation as well. © IJRASET: All Right s are Reserved 598 International Journal for Research in Applied Science & Engineering Technology (IJRASET) ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429 Volume 9 Issue X Oct 2021- Available at www.ijraset.com The accuracy of 0.72 (Output: 0.7204800995992686) using the multi-feature model was achieved. Furthermore, the relevant predictions were also obtained while using this model to predict the genres of random summaries outside the dataset. This model was stored in the pickle files for further applications. To provide a good interface, a flask web application was developed. An AJAX request is performed to the server when the user inputs a summary, and the response is shown by parsing the JSON data received. The preview of the webpage is demonstrated in figure 8. Figure 8: Preview of flask web application V. CONCLUSION The author of the paper worked on two datasets namely CMU Book Summary and the Blurb Genre Collection. After performing various predictions of both the dataset sit was observed that the author was able to develop high-accuracy models that are efficient. Not only into the datasets, nevertheless, the model also worked effectively with random inputs outside of the dataset. The various approaches were used to solve the identical problems including single-label classification, multi-label classification, and multifeature classification were found to be accurate. During the study, it was noted that it is important to maintain the context of the words in the summary as well as to ensure that we are accounting for overlaps between various genres. This was achieved using the multi-label model. Although the classification of the book genre prediction is complex in nature, however, the above models proved to be successful and accurate to predict the book genre. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Leo Breiman. 2001. Random forests. Machine learning, 45(1):5–32. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM. Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. ¨ Neural Comput., 9(8):1735–1780, November. Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56. Brett Kessler, Geoffrey Numberg, and Hinrich Schutze. 1997. Automatic detection of text genre. In ¨ Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 32–38. Association for Computational Linguistics. Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302. Aleksander Kolcz, Vidya Prabakarmurthi, and Jugal Kalita. 2001. Summarization as feature selection for text categorization. In Proceedings of the tenth international conference on Information and knowledge management, pages 365–370. ACM. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI, volume 333, pages 2267–2273. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188–1196. Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057. Ying Liu, Han Tong Loh, and Aixin Sun. 2009. Imbalanced text classification: A term weighting approach. Expert systems with Applications, 36(1):690–701 © IJRASET: All Right s are Reserved 599