Document Image Classification using Deep Learning
Document Image Classification using Deep Learning
I. INTRODUCTION
The age-old problem of document classification is still are several layers involved: convolutional, pooling,
prevailing in the world of computer science, and this paper flattening, and dense layers.
titled "A Novel 2D Deep Convolutional Neural Network for In the initial convolutional layers, the model applies filters
Multimodal Document Categorization" offers a new to the input image in order to achieve low-level features like
solution that attains multimodality by training CNN-based edges and textures. The deeper layers are able to capture
deep learning architectures on digitized documents by more complex patterns required for distinguishing between
fusing information from the text and image modalities to types of resumes, invoices, scientific reports, etc.
improve the accuracy of such jobs. The classification of Immediately following each convolutional layer is a pooling
textual information is performed by an RNN, while image layer, which takes spatial dimensions away from the
processing is performed by a CNN. These results are features. It serves to decrease the computational
combined together in a fusion layer for classification. The requirements but also stops overfitting. After three
experimental results indicate that this multimodal approach convolution-pooling blocks, features are flattened into a
significantly outperformed single-modality methods, with single vector and then sent to dense layers, which end up
acknowledgement of higher accuracy for documents under with softmax output layers having 16 nodes, each
the given consideration, which belong to different industries representing one of the document classes in consideration.
like finance, healthcare, and legal domains. This model The categorical cross-entropy loss function can be used
could be applied in automated document management because this is indeed a multi-class classification problem,
systems: The following model promises to improve both and the Adam optimizer can be used, which dynamically
efficiency and accuracy in document classification. [4] adjusts the learning rates for good training. Training is
basically feeding batches of images through the network,
Using deep transfer learning and feature reduction, the updating the weights to minimize the error between the
paper "An Improved Document Image Classification using classes that were actually predicted and those that were
Deep Transfer Learning and Feature Reduction" has actually in the classes. The validity data is checked for
presented a framework to classify document images. monitoring the generalization accuracy of the model.
Utilizing pre-trained deep learning models in this study like Finally, a Gradio interface utilizes a trained CNN to enable
DenseNet121, VGG19, and machine learning classifiers users to upload images and get their prediction, hence it will
resulted in up to 97.83% of accuracy on the small dataset of be utilized as a simple practical application in document
scanned documents used in this research study. The classification.
framework includes dimensionality reduction techniques
like PCA and LDA in order to enhance the processing speed II. LITERATURE SURVEY
without detracting from the performance. It holds potential
applications in enterprise resource planning systems, The paper addresses the issues of classifying document
especially with regard to document management and images using deep convolutional neural networks (DCNNs)
preprocessing in OCR systems. [5] with intra-domain transfer learning and stacked
generalization. It uses VGG16-based DCNNs, which have
The research paper "A Document Image Classification been fine-tuned for classification between ImageNet and
System Fusing Deep and Machine Learning Models" document images, and it classifies the document into
studied an array of methods meant to classify documents, different regions whereby independent regions are trained
with particular insight directed toward a process of using their separate models. By stacking the predictions of
digitization in university document management systems. these region-specific models to form hybrid models, an
Some techniques employed in the classification are based accuracy improvement of 92.21% was achieved on the
on OCR and deep learning, or else at times employed the RVL-CDIP dataset-a new benchmark-breaking effort. The
assistance of ensemble methods, achieving a noteworthy approach improves training efficiency by inter-domain and
degree of accuracy (94.45% F-score) with a fusion model of intra-domain transfer learning in machine learning-based
EfficientNetB3 and ExtraTree classifiers. The analysis has document image classification. [7]
shown that the combination of content-based features with
those based on images contributes greatly to the accuracy of We are going to give an extensive study on the feasibility
classification. This is to address the needs of document of using deep CNN for document image classification and
management and minimize human work on document retrieval. We want to compare CNN-based features with
verification. [6] traditional handcrafted ones used for such purposes and
establish their superiority. Our experiments show that when
ALGORITHMS USED: with enough training data, pre-trained CNNs could work
excellently in non-document classification tasks; therefore
This code implements an algorithm that goes by the there is no necessity to have region-specific CNN models
name Convolutional Neural Network (CNN)-a deep anymore. The best accuracy values for large document
learning architecture specifically effective for image datasets were achieved through holistic CNN approaches.
classification. It's able to capture hierarchies with the spatial [8]
information present within images. The CNN model
implemented was based on TensorFlow and Keras. There The article will introduce a multimodal neural network
model for document classification integrating text and
image modalities. It exploits both visual features from to capture features of increased complexity and abstraction,
images extracted via MobileNetV2 and textual content therefore simplifying training with the use of residual
processed through Tesseract OCR and FastText connections.
embeddings. Single-modality baselines on the RVL-CDIP Application: It is the best choice for mixed heterogeneous
and Tobacco3482 datasets which are capable of improving document datasets, because of its deep feature-rich layer and
with this model score a 3% higher classification accuracy the possibility for the network to represent more complex
than some early incarnations did. It is shown that
document structures.
integration of text and image modalities provides a means
Limitations: Demands are high in terms of computation;
for finer-grained document classification, even when the
text produced by OCR is imperfect. The paper concludes can use very large datasets to alleviate overfitting.
with a discussion on how multimodal learning infers
practical applications in processing documents. [9] 4. Inception Networks (GoogLeNet, InceptionV3)
Description: It uses so-called "inception modules" which
In this work, the authors explore Convolutional Neural should run parallel convolutions of different sizes, including
Networks (CNNs) for document image classification, 1x1, 3x3, and 5x5 to catch features at multiple scales.
mainly modifying CNN architecture and data augmentation This application would help images of documents that are
methods in such a way that focuses on document-specific structured with diversified layouts and contain elements
features rather than general image datasets. The other such as varying font sizes, mixed content, etc.
findings clearly demonstrate that performance is better on Limitations: Extremely complex architecture, which indeed
the RVL-CDIP dataset with shear transformation and larger is very hard to engineer and then fine-tune appropriately.
images as input. The authors also study the design
parameters of a CNN such as depth, width, and input size 5. EfficientNet
and illustrate how CNNs may learn the region-wise layout Description: It noticed EfficientNet and took note of some
features for document classification. Thereby achieving things about it, like balancing between accuracy and
state-of-the-art results on RVL-CDIP through tuning CNN computational cost-wise. It also systematically scales depth,
architecture and data preprocessing. [10] width, and resolution of a network.
Applications are appropriate in the field of document
III. EXISTING SYSTEM imaging, in which high precision achievements chip in
efficiently to the resource use range.
1. AlexNet Limitations: EfficientNet requires very careful scaling to
Description: AlexNet highlighted CNNs when it won the reach its peak performance, so it may demand extra
ImageNet 2012 challenge. It consisted of five convolutional customization for document-specific tasks. Because of this,
layers and three fully connected layers, with the main even the most minute details must be attended to.
emphasis on progressively deeper layers so as to perform
feature extraction.
Application: Most generally used as a benchmark for
document classification since its performance is good IV. PROPOSED SYSTEM
enough to handle image data easily.
Limitations: Input size is fixed at 227x227, which is apt to The TIFF document image classification system is
perform poorly on documents having highly complex developed with an earnest but not vain effort toward the
layouts. categorization of the 16 classes of images: these encompass
the entire spectrum of document issues, like invoices,
2. VGGNet resumes, letters, and sorting of others. The workflow's
Specifications: Depth from 16 to 19 and reliant on majorly complete series, inclusive of data cleaning and
small convolutions (3x3) in depths. They are quite effective preprocessing modules, model training, and Gradio-based
when it comes to grabbing those fine-grained features. preclassification deployment, means that the task of
Application: Mostly used in processing huge-sized scans of classification can be completed through minimal user
detailed documents so that the little nuances like font details interferon, hence allowing an accurate execution of the task.
or layout could be clearly understood. Section talking about the system from an outside view:
Limitations: This deep architecture makes the framework
Data Cleaning Module:
somewhat computationally expensive, and other innovations
The rule for cleaning the TIFF-only dataset includes:
like residual learning are quite absent to ease the training I. elimination of all the non-TIFF files;
instability. II. integrity check of all TIFF files, assuring that every
document was readable;
3. ResNet (Residual Networks) III. removal of all corrupted files, deleted files that could
Description: It does away with the need for depth in its not be opened, alongside validation to a considerable extent
neural network architecture like ResNet-50 and ResNet-152 for quality training.
letters. After the finalization of a convenient way since the
Data Preprocessing Module: dataset involves several techniques, it finally moved on to
I. Resizing all the images to 224x224 pixels, which is the model building and development to interface it with a user-
standard size preferred as input by CNNs, normalizes the friendly format.
pixel values by shrinking them into values between 0 and 1, 1.Data Preprocessing: The first step involves filtering out
II. splitting models will actually assist in the evaluation, all non-TIFF files with corrupted images from the TIFF
train-test split at 80%-20% respectively .
images for the purpose of acquiring clean data for the
current task. The second step accounts for reshaping and
3. Deep Learning Model:
The convolutional neural network (CNN) essentially rescaling all images to 224x224 pixels normalized around a
employs: mean and variance. The last should involve splitting data
consists of three convolutional layers which extract features into an 80-20 percent train-validation split by which the
from documents through consecutive layers; ability of the model is assessed.
with a final flattening layer connecting this to a class system 2.Model Architecture: The architecture of the CNN
that represents 16 unique class outputs. consists of several layers. Convolution layers work to
extract features from the document images like edges and
4. Training and Evaluation Module: layouts; afterwards, the pooling layers reduce the spatial
The model was trained with categorical crossentropy loss dimensions to mitigate the problem of overfitting and cut
and Adam optimizer for the balance of the learning speed down on computational load. Finally, fully connected dense
with which the final model performance stands. layers feed into sixteen softmax classifiers-the outputs of the
feature vectors obtained from the previous layers.
5. Gradio's Prediction Interface:
3.Training and Validation: The categorical cross-entropy
This provides an easily accessible interface through
common web constructs to allow users to upload any TIFF loss is used with Adam as an optimizer where the learning
images for classification. rate parameter was adaptive to improve convergence.
It thus makes the contribution of a predicted type of Accordingly, the monitoring of training and validation
document substantiated by a confidence score almost in real- accuracy metrics over ten epochs of work ensured an
time. evaluation of a degree of plausibility.
4.Deployment: Gradio provided direct access to the model
for image uploads to be predicted in real time along with
confidence scores. This facilitates instantaneous
classification and connects the model better to the
requirements of its real application.
Hence it is a very much a choiceable document
classification system, with higher accuracy and ease of use,
to turn into document management and archiving-specific
applications.
VI. RESULT