CSEMP91
CSEMP91
Submitted by
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING
NOVEMBER 2024
ii
C.V. RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054
BONAFIDE CERTIFICATE
Certified that this project report Synthetic Data Set Generation Scheme
for Handwritten Content Extraction from Financial Document is a 7th
Semester bonafide work submitted by ADITYA AGRAWAL, RegistrationNo.-
2101020005, ANISH KUMAR, Registration No.-2101020014, ANURAG
KUMAR, Registration No.-2101020023, SAURAV KUMAR, Registration No.-
2101020038, CGU-Odisha, Bhubaneswar who carried out the project under
my supervision.
ii
C.V. RAMAN GLOBAL UNIVERSITY
BHUBANESWAR-ODISHA-752054
CERTIFICATE OF APPROVAL
This is to certify that I have examined the project entitled Synthetic Data Set
Generation Scheme for Handwritten Content Extraction from Financial
Document submitted by ADITYA AGRAWAL, Registration No.- 2101020005,
ANISH KUMAR, Registration No.-2101020014, ANURAG KUMAR,
Registration No.-2101020023, SAURAV KUMAR, Registration No.-2101020038,
CGU-Odisha, Bhubaneswar. I hereby accord our approval of it as a major project work
carried out and presented in a manner required for its acceptance towards completion
of major project stage-I (7th Semester) of Bachelor Degree of Computer Science &
Engineering for which it has been submitted. This approval does not necessarily
endorse or accept every statement made, opinion expressed or conclusions drawn as
recorded in this major project, it only signifies the acceptance of the major project for
the purpose it has been submitted.
SUPERVISOR
iii
DECLARATION
We declare that this project report titled Synthetic Data Set Generation Scheme for
Handwritten Content Extraction from Financial Document submitted in partial
fulfillment of the degree of B. Tech in (Computer Science and Engineering) is a
record of original work carried out by me under the supervision of Dr Prabhat
Dansena, and has not formed the basis for the award of any other degree or diploma,
in this or any other Institution or University. In keeping with the ethical practice in
reporting scientific information, due acknowledgements have been made whereverthe
findings of others have been cited.
iv
ACKNOWLEDGEMENT
We would like to articulate our deep gratitude to our project guide Dr. Prabhat
Dansena, Assistant Professor, Computer Science and Engineering Department,
who has always been source of motivation and firm support for carrying out the project.
We would also like to convey our sincerest gratitude and indebtedness to all other
faculty members and staff of Department of Computer Science and Engineering, who
bestowed their great effort and guidance at appropriate times without it would have been
very difficult on our project work.
An assemblage of this nature could never have been attempted with our reference to
and inspiration from the works of others whose details are mentioned in the references
section. We acknowledge our indebtedness to all of them. Further, we would like to
express our feeling towards our parents and God who directly or indirectly encouraged
and motivated us during assertion.
v
TABLE OF CONTENTS
DESCRIPTION PAGE NO
BONAFIDE CERTIFICATE ii
CERTIFICATE OF APPROVAL iii
DECLARATION iv
ACKNOWLEDGEMENTS v
ABSTRACT vii
LIST OF FIGURES viii
1. INTRODUCTION 01-02
1.1 Handwritten Content Extraction 01
1.2 Problem Statement 01
1.3 Objective 01
1.4 Solution 02
2. RESEARCH OVERVIEW
2.1 Literature Survey 03
3. METHODOLOGY 04
3.1 Dataset Preparation 04
3.1.1 Description of Dataset 05
3.1.2 Scanning and Preprocessing 06
3.1.3 Background Removal 06
3.2 Implementation of Model 07
3.2.1 Document Template Creation 07
3.2.2 Synthetic Dataset Creation 08
3.2.3 Text Drawing Function 09-10
3.3 Block Diagram 11
4. RESULTS 12
5. CONCLUSION 13
6. REFERENCES 14
vi
ABSTRACT
The Synthetic Data Set Generation Scheme for Handwritten Content Extraction
from Financial Documents offers a novel approach to address limitations of real
data scarcity, privacy concerns, and variability in handwriting styles. Financial
documents, including invoices, receipts, and checks, often contain sensitive
information, limiting availability for training machine learning models.
vii
LIST OF FIGURES
viii
LIST OF TABLES
ix
1. INTRODUCTION
However, several challenges hinder effective handwritten content extraction. One significant
issue is the variability in handwriting styles [2]. Differences in cursive, print, or mixed writing
styles introduce inconsistencies that make it difficult for machine learning models to generalize
across datasets. Moreover, the diverse layouts and structures of financial documents add further
complexity to text recognition tasks, as the same type of information can appear in different
formats across documents.
1.3 Objective
The objectives of this project are to:
Develop Synthetic Data Generation Techniques: Create diverse and realistic synthetic
datasets by simulating a variety of handwriting styles, ink
1
types, and paper textures using generative models [1, 5].
Ensure Data Privacy: Address privacy concerns by generating synthetic data that
replicatesthe characteristics of real data, eliminating the need for sensitive, real-
world documents in the training process [7].
1.4 Solution
To address the challenges associated with handwritten content extraction, this
project proposesa synthetic data generation scheme as an innovative solution. The
core idea is to create realistic, scalable, and diverse datasets that replicate the
complexity and variability of real- world handwritten financial documents.
2
2. RESEARCH OVERVIEW
The literature survey encompasses various studies focusing on handwritten content extraction and
synthetic data generation. Prabhat Dansena, Soumen Bag, and Rajarshi Pal (2021) proposed a method
using CNN and K-means clustering to generate synthetic data for detecting handwritten word
alterations, emphasizing increased dataset diversity while acknowledging limited real-world variability
[in Table 1].
Rajesh Kumar, Nikhil R. Pal, Bhabatosh Chanda, and J.D. Sharma (2012) investigated forensic
detection of fraudulent alterations in ball-point pen strokes using k-NN, MLP, and SVM, highlighting
automation advantages but with a narrow focus on specific pen strokes [in Table 1].
Further, Partha Pratim Roy and collaborators introduced GAN-based synthetic data generation for Indic
handwriting recognition, expanding dataset inclusivity for underrepresented languages while grappling
with realism limitations. Lars Vögtlin et al. employed OCR-constrained GANs to generate synthetic
handwritten historical documents, achieving high realism but demanding significant computational
resources [in Table 1].
Sobhan Kanti Dhara and Debashis Sen advanced image super-resolution techniques for improving the
quality of handwritten document images, though these techniques faced challenges with highly distorted
inputs. Lastly, Vittorio Pippi and colleagues utilized visual archetypes for handwritten text generation,
producing realistic text overlays with certain limitations in style generalization.Together, these studies
underscore advancements in synthetic data generation and handwritten text extraction while
highlighting ongoing challenges in achieving robust generalization and realistic data synthesis [in Table
1].
3
2.1 Literature Survey
4
2. METHODOLOGY
ii. Diversity: The handwritten dataset includes 50 repetitions of each capital letter (A-Z),
lowercase letter (a-z), digits (1-9), and various special characters. This repetition helps capture
different writing styles and variations, mimicking real-world scenarios.
Scanning Process: Each handwritten page containing the characters is scanned at high
resolution to ensure clear digitization of the images. This step preserves the finer details
of the characters and maintains consistency across the dataset.
Background Removal: The scanned images often include unwanted backgrounds, such
as paper texture or marks. Techniques like thresholding, background subtraction, or
masking are applied to remove these unwanted elements. This ensures the characters
are isolated against a clean background.
Noise Reduction: Noise, such as smudges or minor ink marks, can impact the accuracy
of machine learning models. Filters like Gaussian blur or median filtering are used to
reduce noise while preserving the integrity of the character's structure.
Foreground Enhancement: The primary focus of each image, which is the handwritten
character, is enhanced to improve contrast and visibility. This may include adjusting
the brightness, contrast, or applying edge enhancement techniques to make the
character stand out against the now-clean background.
Uniformity and Normalization: After background and noise removal, the images are
resized and normalized to consistent dimensions and pixel intensities. This step ensures
uniform input for machine learning models, making training and evaluation processes
more efficient
5
3.1.2 Scanning and Preprocessing
i. Scanning the Documents:
Document Preparation: Before scanning, documents are cleaned (e.g., removing dust and
staples) to avoid interference and improve the quality of the scanned output.
Flatbed Scanners: Flatbed scanners are often used for such tasks because they provide
uniform lighting and minimize distortions, ensuring that the scanned image accurately
represents the original document.
ii. Converting to RGB Format:
What is RGB: RGB stands for Red, Green, and Blue. It is a color model in which these
three primary colors are combined in various ways to reproduce a broad spectrum of colors.
Converting scanned documents to RGB ensures that the digital version captures all color
details, including any highlights, stamps, or handwritten marks in various color.
Conversion Process:
Color Detection: The scanner software or image processing tool identifies the color
spectrum in the scanned document.
Color Space Conversion: The image is converted from the scanner's native format (e.g.,
grayscale or CMYK) into RGB format using image processing libraries such as
OpenCV, PIL (Pillow), or built-in scanner software.
Ensuring Clarity and Fidelity: The RGB conversion is done in a way that enhances
contrast and clarity. Adjustments might be made to color balance and saturation to
ensure that the digital document accurately reflects the original's color scheme, making
details such as red or blue ink and highlighted sections visible.
Noise Reduction: The RGB-converted image may still have noise or artifacts from the
scanning process. Techniques like median filtering, Gaussian blur, or bilateral filtering
are used to reduce noise while preserving important edges and details [2].
6
iv. Uniformity and Data Integrity:
Consistency in Image Format: Converting all scanned documents into the RGB format
ensures consistency, making subsequent data processing more efficient. Uniform image
formats simplify pipeline processing in machine learning or OCR applications.
Maintaining Document Integrity: Special care is taken to ensure that the conversion and
preprocessing steps do not alter or degrade crucial elements of the document, such as
text legibility, signatures, or security features like watermarks.
v. Use Cases and Applications:
• Optical Character Recognition (OCR): RGB-converted documents are ideal for OCR
systems that extract text and data from documents. The color information helps the
OCR software differentiate between various types of content (e.g., printed vs.
handwritten text).
• Financial Analysis: Digital versions of financial documents can be used for automated
data extraction, record-keeping, and analysis. The high-quality, RGB-formatted images
ensure accurate data capture.
7
3.2 Implementation of Model
The templates are designed to accommodate diverse layouts and formats to ensure the
synthetic data reflects the variability seen in actual documents. For example, templates
are created with different positioning for text fields and varied column widths to mimic
the real-world diversity of financial documents. By using these pre-designed templates,
the system can generate a wide variety of synthetic financial documents that closely
resemble authentic handwritten content while maintaining consistent structure for
model training
8
3.2.2 Synthetic Dataset Generation
i. It generates a random first and last name using the fake library, then combines them
into a full name with a space between them. The amount is a randomly generated
floating-point number between ₹10.00 and ₹1000.00, formatted as currency (with
commas and two decimal places). It also generates a random date using the fake date ()
function.
ii. Next, the code specifies the positioning of the text to be added to the template image
by defining three position tuples in the positions list. The shift amount adjusts the
horizontal position, while the template height determines the vertical placement, placing
text at different heights. The texts list holds the formatted text, including the name,
amount, and date.
iii. The draw handwritten text function is called in a loop to draw each text string at the
corresponding position on the template image. The char images is likely a set of
predefined images for each character used to simulate handwritten text. Once all the text
is added to the template, the image is saved to the output path. The function then returns
a dictionary containing the filename, name, amount, and date that were used to generate
the image [5, 7].
iv. After the data is generated (in Figure 1 and 2), it is organized into a structured format
such as a data frame, making it easier to manage and process.
The data is then exported to a .csv file for storage. This format is widely used for its
simplicity and compatibility with various software tools. Export functions in
programming libraries ensure that the data is stored cleanly, often with options like
excluding index columns for better readability.
This approach maintains data accessibility and usability, supporting analysis or further
processing in future projects. Adopting these practices results in well-organized and
reliable data storage.
9
3.2.3 Text Drawing Function
The text drawing function (in Figure 3) serves as a tool to create realistic text overlays on
images,emulating human handwriting. This function is designed to handle various aspects of
text rendering to ensure that the resulting output closely resembles handwritten text. The main
components and their functionalities are outlined below:
i. Text Overlay Management: The function accounts for natural text spacing, proper
capitalization, and the inclusion of punctuation to create overlays that mimic real handwriting
styles. This involves an algorithmic approach to determine the spacing between characters and
words, ensuring that the text layout maintains visual coherence and legibility [1].
Character Spacing:
The function determines the optimal distance between characters to replicate the slight
variations seen in real handwriting. Instead of using a fixed value for spacing, it could
introduce randomness or adjust based on the shape and width of the characters to
simulate handwriting variability.
Kerning (adjusting the space between specific character pairs) may be incorporated to
ensure that certain combinations of characters fit together naturally (e.g., 'AV', 'To').
Word Spacing:
The function ensures that spaces between words aren't uniform but have subtle
differences to emulate how handwriting can vary. This slight inconsistency can be
controlled by a randomization parameter or algorithm that tweaks the space within set
limits.
Capitalization and Punctuation:
The function pays attention to the unique traits of capital letters and punctuation. For
instance, it may place punctuation marks (e.g., commas, periods) closer to the preceding
character and lower on the baseline, mimicking typical handwritten nuances.
Capital letters may be slightly larger or more pronounced to reflect how human
handwriting differs between uppercase and lowercase characters.
Baseline Alignment:
The function ensures that all characters align along an imaginary line, with adjustments
for characters like 'y', 'g', or 'p' that have descenders dipping below the baseline. The
alignment might also include slight vertical shifts to add a handwritten effect.
ii. Image Resizing Functionality: To ensure that characters fit properly within the designated
overlay space, the function includes an image resizing mechanism. This feature adjusts the size
of individual character images as needed, allowing for flexibility in text formatting and the
customization of character proportions.
Character Image Scaling:
10
The function can resize individual character images dynamically. For example, if the
text area has limited space, the function scales down the characters proportionally to
prevent them from overlapping or extending outside the designated area.
Scaling up might also be done to emphasize certain text, such as headings or
emphasized words, maintaining readability while keeping the appearance cohesive.
Aspect Ratio Preservation:
When resizing character images, the function ensures that the aspect ratio (the
relationship between width and height) is preserved. This prevents characters from
appearing distorted, which would detract from the handwritten appearance.
Adaptive Resizing for Visual Balance:
The function may adaptively resize characters based on the length of the text. For
instance, if a line of text is longer than the available width, the function scales down
each character slightly to fit within the bounds. Conversely, for shorter strings, the
function might scale up the characters or space them out more to maintain a balanced
look.
It can also consider the size of characters relative to others, such as making capital
letters and specific symbols proportionately larger than lowercase letters for a more
realistic handwriting simulation
11
3.3 Block Diagram
Block diagram illustration for synthetic data generation for the handwritten content
extraction from financial documents (in Figure 4).
12
3. RESULT
Diversity: The synthetic dataset contained a variety of financial documents, including invoices,
receipts, and checks. Different document layouts (e.g., tables, headers, and line items) and
diverse handwriting styles were generated. These variations closely mimicked real-world
financial documents.
Handwriting Simulation: The handwriting generation engine produced realistic handwritten
content, simulating different writing styles (e.g., cursive, block letters) and noise factors (e.g.,
blurring, ink smudges).
Noise Introduction: Noise and distortions such as background artifacts, skewing, and irregular
ink flow were added to reflect real-world scanning or photocopying issues.
Dataset Generation Time: The synthetic dataset generation process was 80% faster compared
to manually collecting and labeling real-world documents. This significant reduction in time
and cost is one of the primary advantages of the proposed scheme.
Manual Labeling: Since synthetic data is automatically annotated, the need for manual labeling
was eliminated, further increasing efficiency.
13
5. CONCLUSION
Synthetic data generation plays a pivotal role in creating large, diverse datasets, thereby
enhancing the training of AI models, particularly for tasks like handwritten content extraction.
By combining synthetic and real data, the models achieve better generalization and accuracy.
Synthetic data reduces the reliance on large labeled datasets, speeding up model development
and lowering associated costs. Furthermore, this approach ensures accessibility to high-quality
datasets even in scenarios where real-world data is scarce, protecting privacy and ensuring
compliance with ethical standards.Synthetic data generation has emerged as a transformative
approach in AI development by addressing critical challenges associated with traditional data
collection and labeling. Its ability to create large, diverse, and high-quality datasets has
significantly improved the training of machine learning models, particularly in areas like
handwritten content extraction. By reducing dependency on manually labeled data, synthetic
data accelerates the development process, cuts costs, and enables innovation even in data-
scarce scenarios. Furthermore, the integration of synthetic and real data enhances model
performance by combining the variability and scalability of synthetic data with the authenticity
of real-world data. This hybrid approach ensures models are robust, adaptive, and capable of
performing accurately in diverse scenarios. Synthetic data's role in preserving privacy is
particularly noteworthy, as it minimizes the risks associated with using sensitive real-world
data, making it an essential tool for privacy-conscious industries. The ability to generate
datasets for underrepresented languages and scripts broadens the scope of AI applications,
enabling the development of inclusive technologies that cater to global populations.
Additionally, the use of synthetic data in domains like fraud detection demonstrates its potential
to tackle complex, high-stakes problems across industries. In conclusion, synthetic data is not
merely a complement to traditional data but a game-changing solution that reshapes how AI
systems are trained, deployed, and scaled. It empowers organizations to build more accurate,
diverse, and privacy-preserving AI solutions, addressing challenges that were previously
difficult or expensive to solve [1, 5, 7].
14
6. REFERENCES
[1] P. Dansena, S. Bag, and R. Pal, "Generation of Synthetic Data for Handwritten
Word Alteration Detection," IEEE, 2021.
[2] R. Kumar, N. R. Pal, B. Chanda, and J. D. Sharma, "Forensic detection of fraudulent
alteration in ball-point pen strokes," IEEE, 2012.
[3] J. Martínek, L. Lenc, and P. Král, "Building an efficient OCR system for historical
documents with little training data," Springer, 2020.
[4] P. Dansena, R. Pal, and S. Bag, "Quantitative assessment of capabilities of colour
models for pen ink discrimination in handwritten documents," IET, 2020.
[5] P. P. Roy, A. Mohta, and B. B. Chaudhuri, "Synthetic data generation for Indic
handwritten text recognition," Journal of Pattern Recognition, 2018.
[6] L. Vögtlin, M. Drazyk, V. Pondenkandath, M. Alberti, and R. Ingold, "Generating
synthetic handwritten historical documents with OCR constrained GANs,"
Springer, 2021.
[7] S. K. Dhara and D. Sen, "Across-scale process similarity-based interpolation for
image super-resolution," Elsevier, 2020.
[8] V. Pippi, S. Cascianelli, and R. Cucchiara, "Handwritten text generation from visual
archetypes supplementary material," Conference Proceedings, 2023.
15
PLAGIARISM REPORT
16