0% found this document useful (0 votes)
44 views

Document Image Classification using Deep Learning

The document discusses a deep learning-based framework for classifying scanned TIFF document images into sixteen categories using CNNs and transfer learning techniques. It highlights the use of pre-trained models like DenseNet121 and VGG19, achieving high accuracy rates, and proposes a user-friendly interface for real-time document classification. The study emphasizes the potential applications in document management and automation systems, particularly in sectors like finance and healthcare.

Uploaded by

m699599499
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Document Image Classification using Deep Learning

The document discusses a deep learning-based framework for classifying scanned TIFF document images into sixteen categories using CNNs and transfer learning techniques. It highlights the use of pre-trained models like DenseNet121 and VGG19, achieving high accuracy rates, and proposes a user-friendly interface for real-time document classification. The study emphasizes the potential applications in document management and automation systems, particularly in sectors like finance and healthcare.

Uploaded by

m699599499
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Document Image Classification using Deep Learning

Vigneshkumar M Dr. K.S.Kannan Harsha vardhan nalleboina


Associate Professor Associate Professor 99220042099
Department of Computer Department of Computer Computer Science and
Science and Engineering, Science and Engineering, Engineering
Kalasalingam Academy of Kalasalingam Academy of Kalasalingam Academy of
Research and Education Research and Education Research and Education,
Krishnankoil, Virudhunagar Krishnankoil, Virudhunagar Krishnankoil
[email protected] [email protected] [email protected]

Tharun narra Devi prasad ponnapula Rajendra reddy bijjam


99220042096 99220042100 99220042130
Computer Science and Computer Science and Computer Science and
Engineering Engineering Engineering
Kalasalingam Academy of Kalasalingam Academy of Kalasalingam Academy of
Research and Education, Research and Education, Research and Education,
Krishnankoil Krishnankoil Krishnankoil
[email protected] [email protected] [email protected]

Abstract: the present project anticipated the


development of the cnn-based document classifier, yet Thus, the research paper proposes a document
another deep learning-based approach to the classification framework based on deep transfer learning
classification of scanned TIFF document images into and feature reduction techniques. The study has used
sixteen different types, including but not limited to various pre-trained models (e.G. Densenet121, VGG19)
invoices, resumes, letters, and scientific papers. The along with classifiers like logistic regression and k-nearest
dataset is preprocessed by cleaning non-tiff files as well neighbors to yield a remarkable performance. PCA and
as corrupted images, to ensure only good quality data LDA reduction techniques are useful in optimization for the
for the training of models. It is the image preprocessing best trade-off between accuracy and efficiency, leading to
pipeline that resizes all images into a standardized size significant results such as 97.83% by densenet121-lda-lr on
(224x224 pixels) and rescales the pixel values for 546 images. The work proposes to recommend this
enhanced performance of the model. Leverage approach for inclusion into the ERP system environment
tensorflow and keras for structuring the model with where it can support document processing on tasks like
several convolutional and pooling layers for feature OCR. [1]
extraction, with subsequent dense layers classifying the
document. While finalizing the model, an 80-20 ratio The developed multimodal deep learning model for
was used for the train-validation split, working classifying digitized documents involves CNNs for images
categorical cross-entropy for loss and attempting to and RNNs for text features. As these hybrid models do not
enhance learning using the adam optimizer. drop below the accuracy achieved by single-modality
models, this approach offers lot of promise in financial,
Model performance is monitored over ten epochs, health-care, and jurisprudence applications. Accuracy
wherein plotting of training and validation accuracies results 94.84% on the testing set of 9,125 documents
allows one to judge its efficiency. Finally, in a gradio- separated into seven categories. [2]
based user-friendly interface, a TIFF document image
can be uploaded and a prediction along with a This paper will aim to outline a framework for
confidence score is given. In this way, this interface can unstructured financial documents by embracing robotic
take advantage of the trained model that allows real- process automation and even implementing a multimodal
time classification in a web format, thus putting the approach. This would ensure the employment of RPA
model within reach for practical applications. The alongside a pre-trained model in deep learning for
project effectively demonstrates the systematic classification and key information extraction of multilingual
simplicity of automated document classification, documents. Thus, models like LayoutXLM document
potentially applicable in digital archiving, document understanding will see this approach better serviced and
management, which requires fast and accurate more accurate while processing its tasks-the ones they are
document classification. most relevant to regarding banking, of course. This
framework was also demonstrated to significantly improve
Keywords: Deep learning , Document classification , the accuracy of KIE, especially under key-value labeling,
Convolutional neural networks , Image preprocessing , TIFF and reduce the time it takes to complete business processing
images , TensorFlow , Keras , Gradio interface , Model
evaluation , Image classification by more than 30% in some applications. [3].

I. INTRODUCTION
The age-old problem of document classification is still are several layers involved: convolutional, pooling,
prevailing in the world of computer science, and this paper flattening, and dense layers.
titled "A Novel 2D Deep Convolutional Neural Network for In the initial convolutional layers, the model applies filters
Multimodal Document Categorization" offers a new to the input image in order to achieve low-level features like
solution that attains multimodality by training CNN-based edges and textures. The deeper layers are able to capture
deep learning architectures on digitized documents by more complex patterns required for distinguishing between
fusing information from the text and image modalities to types of resumes, invoices, scientific reports, etc.
improve the accuracy of such jobs. The classification of Immediately following each convolutional layer is a pooling
textual information is performed by an RNN, while image layer, which takes spatial dimensions away from the
processing is performed by a CNN. These results are features. It serves to decrease the computational
combined together in a fusion layer for classification. The requirements but also stops overfitting. After three
experimental results indicate that this multimodal approach convolution-pooling blocks, features are flattened into a
significantly outperformed single-modality methods, with single vector and then sent to dense layers, which end up
acknowledgement of higher accuracy for documents under with softmax output layers having 16 nodes, each
the given consideration, which belong to different industries representing one of the document classes in consideration.
like finance, healthcare, and legal domains. This model The categorical cross-entropy loss function can be used
could be applied in automated document management because this is indeed a multi-class classification problem,
systems: The following model promises to improve both and the Adam optimizer can be used, which dynamically
efficiency and accuracy in document classification. [4] adjusts the learning rates for good training. Training is
basically feeding batches of images through the network,
Using deep transfer learning and feature reduction, the updating the weights to minimize the error between the
paper "An Improved Document Image Classification using classes that were actually predicted and those that were
Deep Transfer Learning and Feature Reduction" has actually in the classes. The validity data is checked for
presented a framework to classify document images. monitoring the generalization accuracy of the model.
Utilizing pre-trained deep learning models in this study like Finally, a Gradio interface utilizes a trained CNN to enable
DenseNet121, VGG19, and machine learning classifiers users to upload images and get their prediction, hence it will
resulted in up to 97.83% of accuracy on the small dataset of be utilized as a simple practical application in document
scanned documents used in this research study. The classification.
framework includes dimensionality reduction techniques
like PCA and LDA in order to enhance the processing speed II. LITERATURE SURVEY
without detracting from the performance. It holds potential
applications in enterprise resource planning systems, The paper addresses the issues of classifying document
especially with regard to document management and images using deep convolutional neural networks (DCNNs)
preprocessing in OCR systems. [5] with intra-domain transfer learning and stacked
generalization. It uses VGG16-based DCNNs, which have
The research paper "A Document Image Classification been fine-tuned for classification between ImageNet and
System Fusing Deep and Machine Learning Models" document images, and it classifies the document into
studied an array of methods meant to classify documents, different regions whereby independent regions are trained
with particular insight directed toward a process of using their separate models. By stacking the predictions of
digitization in university document management systems. these region-specific models to form hybrid models, an
Some techniques employed in the classification are based accuracy improvement of 92.21% was achieved on the
on OCR and deep learning, or else at times employed the RVL-CDIP dataset-a new benchmark-breaking effort. The
assistance of ensemble methods, achieving a noteworthy approach improves training efficiency by inter-domain and
degree of accuracy (94.45% F-score) with a fusion model of intra-domain transfer learning in machine learning-based
EfficientNetB3 and ExtraTree classifiers. The analysis has document image classification. [7]
shown that the combination of content-based features with
those based on images contributes greatly to the accuracy of We are going to give an extensive study on the feasibility
classification. This is to address the needs of document of using deep CNN for document image classification and
management and minimize human work on document retrieval. We want to compare CNN-based features with
verification. [6] traditional handcrafted ones used for such purposes and
establish their superiority. Our experiments show that when
ALGORITHMS USED: with enough training data, pre-trained CNNs could work
excellently in non-document classification tasks; therefore
This code implements an algorithm that goes by the there is no necessity to have region-specific CNN models
name Convolutional Neural Network (CNN)-a deep anymore. The best accuracy values for large document
learning architecture specifically effective for image datasets were achieved through holistic CNN approaches.
classification. It's able to capture hierarchies with the spatial [8]
information present within images. The CNN model
implemented was based on TensorFlow and Keras. There The article will introduce a multimodal neural network
model for document classification integrating text and
image modalities. It exploits both visual features from to capture features of increased complexity and abstraction,
images extracted via MobileNetV2 and textual content therefore simplifying training with the use of residual
processed through Tesseract OCR and FastText connections.
embeddings. Single-modality baselines on the RVL-CDIP Application: It is the best choice for mixed heterogeneous
and Tobacco3482 datasets which are capable of improving document datasets, because of its deep feature-rich layer and
with this model score a 3% higher classification accuracy the possibility for the network to represent more complex
than some early incarnations did. It is shown that
document structures.
integration of text and image modalities provides a means
Limitations: Demands are high in terms of computation;
for finer-grained document classification, even when the
text produced by OCR is imperfect. The paper concludes can use very large datasets to alleviate overfitting.
with a discussion on how multimodal learning infers
practical applications in processing documents. [9] 4. Inception Networks (GoogLeNet, InceptionV3)
Description: It uses so-called "inception modules" which
In this work, the authors explore Convolutional Neural should run parallel convolutions of different sizes, including
Networks (CNNs) for document image classification, 1x1, 3x3, and 5x5 to catch features at multiple scales.
mainly modifying CNN architecture and data augmentation This application would help images of documents that are
methods in such a way that focuses on document-specific structured with diversified layouts and contain elements
features rather than general image datasets. The other such as varying font sizes, mixed content, etc.
findings clearly demonstrate that performance is better on Limitations: Extremely complex architecture, which indeed
the RVL-CDIP dataset with shear transformation and larger is very hard to engineer and then fine-tune appropriately.
images as input. The authors also study the design
parameters of a CNN such as depth, width, and input size 5. EfficientNet
and illustrate how CNNs may learn the region-wise layout Description: It noticed EfficientNet and took note of some
features for document classification. Thereby achieving things about it, like balancing between accuracy and
state-of-the-art results on RVL-CDIP through tuning CNN computational cost-wise. It also systematically scales depth,
architecture and data preprocessing. [10] width, and resolution of a network.
Applications are appropriate in the field of document
III. EXISTING SYSTEM imaging, in which high precision achievements chip in
efficiently to the resource use range.
1. AlexNet Limitations: EfficientNet requires very careful scaling to
Description: AlexNet highlighted CNNs when it won the reach its peak performance, so it may demand extra
ImageNet 2012 challenge. It consisted of five convolutional customization for document-specific tasks. Because of this,
layers and three fully connected layers, with the main even the most minute details must be attended to.
emphasis on progressively deeper layers so as to perform
feature extraction.
Application: Most generally used as a benchmark for
document classification since its performance is good IV. PROPOSED SYSTEM
enough to handle image data easily.
Limitations: Input size is fixed at 227x227, which is apt to The TIFF document image classification system is
perform poorly on documents having highly complex developed with an earnest but not vain effort toward the
layouts. categorization of the 16 classes of images: these encompass
the entire spectrum of document issues, like invoices,
2. VGGNet resumes, letters, and sorting of others. The workflow's
Specifications: Depth from 16 to 19 and reliant on majorly complete series, inclusive of data cleaning and
small convolutions (3x3) in depths. They are quite effective preprocessing modules, model training, and Gradio-based
when it comes to grabbing those fine-grained features. preclassification deployment, means that the task of
Application: Mostly used in processing huge-sized scans of classification can be completed through minimal user
detailed documents so that the little nuances like font details interferon, hence allowing an accurate execution of the task.
or layout could be clearly understood. Section talking about the system from an outside view:
Limitations: This deep architecture makes the framework
Data Cleaning Module:
somewhat computationally expensive, and other innovations
The rule for cleaning the TIFF-only dataset includes:
like residual learning are quite absent to ease the training I. elimination of all the non-TIFF files;
instability. II. integrity check of all TIFF files, assuring that every
document was readable;
3. ResNet (Residual Networks) III. removal of all corrupted files, deleted files that could
Description: It does away with the need for depth in its not be opened, alongside validation to a considerable extent
neural network architecture like ResNet-50 and ResNet-152 for quality training.
letters. After the finalization of a convenient way since the
Data Preprocessing Module: dataset involves several techniques, it finally moved on to
I. Resizing all the images to 224x224 pixels, which is the model building and development to interface it with a user-
standard size preferred as input by CNNs, normalizes the friendly format.
pixel values by shrinking them into values between 0 and 1, 1.Data Preprocessing: The first step involves filtering out
II. splitting models will actually assist in the evaluation, all non-TIFF files with corrupted images from the TIFF
train-test split at 80%-20% respectively .
images for the purpose of acquiring clean data for the
current task. The second step accounts for reshaping and
3. Deep Learning Model:
The convolutional neural network (CNN) essentially rescaling all images to 224x224 pixels normalized around a
employs: mean and variance. The last should involve splitting data
consists of three convolutional layers which extract features into an 80-20 percent train-validation split by which the
from documents through consecutive layers; ability of the model is assessed.
with a final flattening layer connecting this to a class system 2.Model Architecture: The architecture of the CNN
that represents 16 unique class outputs. consists of several layers. Convolution layers work to
extract features from the document images like edges and
4. Training and Evaluation Module: layouts; afterwards, the pooling layers reduce the spatial
The model was trained with categorical crossentropy loss dimensions to mitigate the problem of overfitting and cut
and Adam optimizer for the balance of the learning speed down on computational load. Finally, fully connected dense
with which the final model performance stands. layers feed into sixteen softmax classifiers-the outputs of the
feature vectors obtained from the previous layers.
5. Gradio's Prediction Interface:
3.Training and Validation: The categorical cross-entropy
This provides an easily accessible interface through
common web constructs to allow users to upload any TIFF loss is used with Adam as an optimizer where the learning
images for classification. rate parameter was adaptive to improve convergence.
It thus makes the contribution of a predicted type of Accordingly, the monitoring of training and validation
document substantiated by a confidence score almost in real- accuracy metrics over ten epochs of work ensured an
time. evaluation of a degree of plausibility.
4.Deployment: Gradio provided direct access to the model
for image uploads to be predicted in real time along with
confidence scores. This facilitates instantaneous
classification and connects the model better to the
requirements of its real application.
Hence it is a very much a choiceable document
classification system, with higher accuracy and ease of use,
to turn into document management and archiving-specific
applications.

VI. RESULT

The experimental results show that the CNN-based


document classification model categorizes TIFF document
images into one of sixteen categories. High accuracy is
achieved in the training as well as validation datasets. The
model was evaluated over ten epochs while training and was
showing stead improvement in accuracy and loss. The
validation accuracy closely tracks the training accuracy.
This indeed shows how well the model generalizes to
unseen data and does not easily overfit because
convolutional and pooling layers are extracting meaningful
features from the document images.
The model was tested using a Gradio interface after training,
Fig1.1 flowchat which allowed users to upload TIFF images and receive
classifications with confidence scores. It worked correctly,
V. METHODOLOGY processing each document image quickly in real time. Its
strengths were in correctly distinguishing visually distinct
The paper describes CNN-based document classification types of documents- resumes and scientific reports-but
scheme, which works on a dataset of TIFF images with failed sometimes with documents having similar layouts or
minimal distinguishing features.
sixteen types of documents including invoices, resumes, and
The system is also suitable to be used in various digital Deep Transfer Learning and Feature Reduction,"
archiving and automated document management International Journal of Advanced Trends in Computer
applications where large volumes of documents need to be Science and Engineering, vol. 10, no. 2, pp. 549-557,
classified efficiently. Though the model could still be Mar.-Apr.2021,doi: 10.30534/ijatcse/2021/141022021.
improved further by using more training data or fine-tuning
[1]
for accuracy on challenging classes, the model does
currently represent an effective and reliable high-
performance tool for document classification. Integration of IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T.
Gradio further makes it possible for non-technical users to Turymbetov, V. Nakhipova, S. Turmaganbetova, and
use the functionalities provided by the model in a simple Z. Ikram, "A Novel 2D Deep Convolutional Neural
interface. Network for Multimodal Document Categorization,"
International Journal of Advanced Computer Science
VII. CONCLUSION and Applications, vol. 14, no. 7, pp. 720-728, 2023,
doi: 10.14569/IJACSA.2023.0140779. [2]
In conclusion, the project has successfully
implemented a strong and accurate document IEEE Citation: S. Cho, J. Moon, J. Bae, J. Kang, and S.
classification model using convolutional neural Lee, "A Framework for Understanding Unstructured
networks for classifying TIFF documents to sixteen Financial Documents Using RPA and Multimodal
different styles of invoices, CVs, or scientific papers. Approach," Electronics, vol. 12, no. 4, p. 939, 2023,
This approach ensures lack of requirement of any kind doi: 10.3390/electronics12040939. [3]
of text extraction while it has the appropriate
combination of decent data preparation and model IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T.
design and good user-centered deployments for its
Turymbetov, V. Nakhipova, S. Turmaganbetova, and
accuracy. Initialized ahead oddly well at efficiently
identifying document types as most of the features like Z. Ikram, "A Novel 2D Deep Convolutional Neural
field sets inside each layout are complex, the CNN does Network for Multimodal Document Categorization,"
not even involve the text extraction and so there is no International Journal of Advanced Computer Science
particular limitations of input requirements and the and Applications, vol. 14, no. 7, pp. 720-726, 2023. [4]
document quality, variance, or layout be trusted as well
or in the simplicity of use. Because of the employ of the IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui,
CNN model there is a huge generalization over the "An Improved Document Image Classification using
various formats or types of the documents with ample Deep Transfer Learning and Feature Reduction," Int. J.
good solutions making it really helpful in the Adv. Trends Comput. Sci. Eng., vol. 10, no. 2, pp.
applications where document classification and accurate 549-557, 2021. [5]
digital archiving is required. In addition, as the Gradio
Interface help the end user in real time prediction and
scoring directly from the uploaded images the following IEEE Citation: S. I. Omurca, E. Ekinci, S. Sevim, E. B.
three steps of the working methodology introduces a Edinç, S. Eken, and A. Sayar, "A Document Image
very high level of applicability making it very useful in Classification System Fusing Deep and Machine
case of non students or non programmers who doesn't Learning Models," Applied Intelligence, vol. 53, pp.
have to have any knowledge about computer science 15295–15310, 2023. [6]
making proof of concept of that indeed we can achieve
the mass document classification done automatically. In IEEE Citation: A. Das, S. Roy, U. Bhattacharya, and S.
conclusion this project along with showing the high K. Parui, “Document Image Classification with Intra-
promise as well in the application of CNNs as a large Domain Transfer Learning and Stacked Generalization
scale document classification and document intake
of Deep Convolutional Neural Networks,” 24th Int.
pipeline confirms that they not enough existing
mainstream models as well. Nonetheless future work in Conf. on Pattern Recognition (ICPR), Beijing, China,
making some changes in the model 9 as well or entity 2018. [7]
granularity can be proposed only way to increase
accuracies would be to rely more and more on the IEEE Citation: A. W. Harley, A. Ufkes, and K. G.
document types which are more similar to each other Derpanis, "Evaluation of Deep Convolutional Nets for
and it promises to work more in future on the terms of Document Image Classification and Retrieval," arXiv
the liberation of the document choices and processing preprint arXiv:1502.07058, 2015 [8]
the high intensity documentation desk work.
IEEE Citation: N. Audebert, C. Herold, K. Slimani,
VIII. REFERENCES and C. Vidal, "Multimodal Deep Networks for Text
and Image-Based Document Classification," arXiv
IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui, preprint arXiv:1907.06370, 2019. [9]
"An Improved Document Image Classification using
IEEE Citation: C. Tensmeyer and T. Martinez,
"Analysis of Convolutional Neural Networks for
Document Image Classification," arXiv preprint
arXiv:1708.03273, 2017. [10]

You might also like