Handwritten Text Recognition: Software Requirements Specification
Handwritten Text Recognition: Software Requirements Specification
By
Guided By
1. Abstract
This project seeks to classify an individual handwritten word so that handwritten text can be
translated to a digital form. We used two main approaches to accomplish this task: classifying
words directly and character segmentation. For the former, we use Convolutional Neural
Network (CNN) with various architectures to train a model that can accurately classify words.
For the latter, we use Long Short-term Memory networks (LSTM) with convolution to
construct bounding boxes for each character. We then pass the segmented characters to a
CNN for classification, and then reconstruct each word according to the results of
classification and segmentation.
2. Introduction
Despite the abundance of technological writing tools, many people still choose to take their
notes traditionally: with pen and paper. However, there are drawbacks to handwriting text.
It’s difficult to store and access physical documents in an efficient manner, search through
them efficiently and to share them with others. Thus, a lot of important knowledge gets lost or
does not get reviewed because of the fact that documents never get transferred to digital
format. We have thus decided to tackle this problem in our project because we believe the
significantly greater ease of management of digital text compared to written text will help
people more effectively access, search, share, and analyse their records, while still allowing
them to use their preferred writing method. The aim of this project is to further explore the
task of classifying handwritten text and to convert handwritten text into the digital format.
Handwritten text is a very general term, and we wanted to narrow down the scope of the
project by specifying the meaning of handwritten text for our purposes. In this project, we
took on the challenge of classifying the image of any handwritten word, which might be of
the form of cursive or block writing. This project can be combined with algorithms that
segment the word images in a given line image, which can in turn be combined with
algorithms that segment the line images in a given image of a whole handwritten page. With
these added layers, our project can take the form of a deliverable that would be used by an
end user, and would be a fully functional model that would help the user solve the problem of
converting handwritten documents into digital format, by prompting the user to take a picture
of a page of notes. Note that even though there needs to be some added layers on top of our
model to create a fully functional deliverable for an end user, we believe that the most
interesting and challenging part of this problem is the classification part, which is why we
decided to tackle that instead of segmentation of lines into words, documents into lines, etc.
We approach this problem with complete word images because CNNs tend to work better on
raw input pixels rather than features or parts of an image [4]. Given our findings using entire
word images, we sought improvement by extracting characters from each word image and
then classifying each character independently to reconstruct a whole word. In summary, in
both of our techniques, our models take in an image of a word and output the name of the
word.
Clear and Legible Handwriting: Handwriting recognition models are typically designed to work with
clear and legible handwriting. They may struggle with highly stylized or messy handwriting, where
characters are not well-defined or can overlap. Such cases might require specialized techniques or
additional preprocessing steps.
Language and Script: Handwritten text recognition models are designed for specific languages and
scripts. The model needs to be trained on data that matches the language and script of the target
application. Each language and script may have unique characteristics that require specific modeling
approaches.
Preprocessing and Image Quality: Handwritten text recognition often involves preprocessing steps to
enhance image quality, remove noise, or segment individual characters. The effectiveness of these
preprocessing techniques depends on the quality of the input images. Images with low resolution,
poor contrast, or significant distortions may affect the model's performance.
Model Architecture and Hyperparameters: The choice of model architecture and hyperparameters
can significantly impact the performance of the handwritten text recognition system. Different
architectures, such as CNNs, RNNs, or their combinations, may have varying strengths and
weaknesses for different tasks. The selection of hyperparameters, such as learning rate, batch size,
or regularization techniques, requires careful tuning to achieve optimal results.
Computational Resources: Training and deploying complex handwritten text recognition models can
require substantial computational resources, including powerful GPUs and sufficient memory. The
availability of such resources can impact the feasibility and efficiency of the project.
Human Annotation for Ground Truth Labels: Handwritten text recognition models typically require
labeled data where the ground truth text is known. Obtaining these annotations often requires
manual effort, either by experts or crowdsourcing. The quality and consistency of these annotations
can influence the performance of the model.
Evaluation Metrics: The choice of evaluation metrics depends on the specific objectives of the
handwritten text recognition project. Accuracy, precision, recall, and F1 score are commonly used
metrics. However, the selection of appropriate metrics should consider the project requirements,
such as the importance of character-level accuracy, word-level accuracy, or context-based accuracy.
3. Overall Description
In an on-line handwriting recognition system, the motion of the tip of the stylus (pen) is
sampled at equal time intervals using a digitizer tablet and it is passed to a computer which
runs the handwriting recognition algorithm. In most systems, the data signal undergoes
some iteration process. Then the signal is normalized to a standard size and its slant and
slope is corrected. After normalization, the writing is usually segmented into basic units and
each segment is classified and labelled. Using a search algorithm in the context of a language
model, the most likely path is then returned to the user as the intended string.
Digitizer
Digitizer technology has always been one of the most important bottle-necks in the
development of on-line handwriting recognizers. This technology started o
in the late 1950's [3]. The first digitizer tablets were bulky and opaque. Their sampling rates
were very low (less than 70 Hz) and they generated nonlinear signals depending on the
position of the stylus on the tablet. As digitizer technology matured, by the late 1980's and
early 1990's, less expensive digitizers were developed which were very small and practical
with reduced nonlinearities. These new digitizer made better handwriting recognition
accuracies possible. Many different strategies have been used for building digitizer tablets.
Some of these methods are inductive, capacitative, piezoresistive, conductive, and
electromagnetic stylus tablet interactions.
With the enhancement of Liquid Crystal Display (LCD) technology, digitizer tablets were
placed under the LCD displays such that a transparent look could be given to the writing on
these tablets. With these new digitizers, the trace of the stylus could instantly be displayed
on the LCD display acting as an electronic ink. These tablets were much easier to get used to
for the untrained user. There were still some problems which needed a solution such as
sucient backlighting, parallax due to the thickness of the LCD's, tablet nonlinearities,
excessive weight, slippery glass surfaces, pen design and ease of handling, etc. Most of these
problems have received some very advanced and acceptable considerations and solutions.
Digitizers are now transparent. They have minimal parallax and reduced nonlinearities. some
have great color displays. Manufacturers have induced friction between the tip of the stylus
and the glass over the LCD to simulate a pen-paper like feel. Most important of all, these
digitizers are now much less expensive and they can produce sampling rates of over 200 Hz.
Most experiments reveal that sampling frequencies over 100 Hz are well sufficient for
obtaining good recognition results.
Preprocessing
Most digitizer tablets have built-in low-pass filters in hardware form which take away the
jagged nature of the handwriting signal. These filters in some cases generate more problem
than they solve. For example, the smoothing of the data at the hardware level could
sometimes smooth away natural cusps and corners which are very important in recognizing
certain characters such as v's and s's. It is sometimes more desirable to do minimal filtering
at the hardware level and to leave most of the filtering to the discretion of the recognition
algorithm developer. Once some basic filtering is done, it is usually desirable to magnify the
writing to a standard height such that the recognition scheme could become size
independent. [5] To perform such normalization, the base-line and mid-line should be
estimated.
The area surrounded by the base-line and the mid-line is the only part of any word which is
always non-empty. This makes this area the most reliable portion of the data for usage in
size normalization. Once accurate estimates of the base-line and the mid-line are given, a
magnification factor could be computed from the ratio of the nominal mid-portion size and
that of the input. The entire input data may then be magnified using the obtained
magnification factor.
Other possible normalizations are slant and slope correction. In slant correction, usually
some mean dominant vertically oriented slope is computed. The slope of the data is then
offseted using the difference between a slope of the vertical axis and the computed slope.
This correction is usually done through shearing since for small deformations shearing is a
good approximation for rotation.
Slope correction is usually an iterative process which uses both of the above normalizations
to estimate and re-estimate the slope of the base-line and then the data is slope-corrected
by
shearing it along the vertical axis such that the base-line becomes horizontal.
In software development, an interface refers to a set of rules, protocols, or contracts that define
how different software components or systems can interact and communicate with each other. It
defines the methods, operations, parameters, and data types that are expected to be used when
integrating or using a particular software component.
Interfaces provide a level of abstraction, allowing different components to interact without being
tightly coupled to each other's implementation details. They define a standard way for components
to communicate, enabling modularity, extensibility, and interoperability in software systems.
Input Method: The UI should provide an input method that allows users to input their handwritten
text. This can include options such as drawing directly on a touch screen or using a stylus, uploading
scanned images or photos of handwritten text, or even utilizing digital pen and tablet devices.
Image Preview/Display: If users are uploading images of handwritten text, the UI should include a
preview or display area to show the uploaded images. This allows users to verify that the correct
image has been selected before processing for recognition.
Processing and Recognition Feedback: The UI can provide real-time feedback during the processing
and recognition stage. This can include progress indicators or status updates to inform the user
about the recognition progress. Once the recognition is complete, the recognized text can be
displayed for the user to review.
Text Correction and Editing: To enhance the user experience, the UI can offer options for text
correction and editing. Users may want to make corrections or modifications to the recognized text.
Providing an interface for easy editing, such as selecting and modifying specific characters or words,
can be beneficial.
Output Display: The UI should display the recognized text in a clear and readable format. Depending
on the context and purpose of the application, the recognized text can be displayed in a separate
text box or integrated into a larger document or application.
Error Handling: The UI should handle errors or exceptional cases gracefully. For example, if the
recognition process fails or the input image is of poor quality, appropriate error messages or
suggestions can be displayed to guide the user and help them resolve the issue.
User Assistance and Help: It's beneficial to include user assistance features such as tooltips,
contextual help, or a dedicated help section. These features can guide users on how to use the
interface effectively, provide instructions on capturing high-quality images, or offer tips for
improving recognition accuracy.
Responsive Design: The UI should be designed to be responsive and compatible with various devices
and screen sizes. This ensures a consistent and user-friendly experience across different platforms,
such as desktop computers, tablets, or smartphones.
Visual Design and Aesthetics: Consider the visual design and aesthetics of the UI to create an
appealing and intuitive user interface. Use appropriate colors, fonts, and layout to enhance
readability and usability. Consistency in design elements and adherence to established UI design
principles can contribute to a positive user experience.
User Feedback and Iteration: It's essential to gather user feedback on the UI and iterate based on
their input. Conduct user testing and collect feedback to identify areas for improvement, usability
issues, or additional features that can enhance the user interface.
Application Programming Interface (API): An API provides a set of functions, methods, or protocols
that allow other software systems or developers to interact with the handwritten recognition
system. The API defines the input and output formats, the available functions or operations, and any
required authentication or authorization mechanisms.
Data Input and Output Formats: The software interface should specify the expected input formats
for the handwritten text, such as image formats (JPEG, PNG, etc.), document formats (PDF, DOC,
etc.), or serialized data formats (JSON, XML, etc.). It should also define the output format for the
recognized text, such as plain text, structured data, or specific document formats.
Method Invocation: If the handwritten recognition system is exposed as an API, the software
interface should define the methods or operations that can be invoked to initiate the recognition
process. This includes specifying the parameters required for each method, such as the input image,
recognition options, or language settings.
Error Handling and Status Codes: The software interface should define the error handling
mechanism, including the types of errors that can occur during the recognition process and the
corresponding error codes or messages. It should also specify the status codes or responses that
indicate the success or failure of the recognition process.
Authentication and Security: If the handwritten recognition system requires authentication or access
control, the software interface should detail the authentication methods, security protocols, or
access tokens required to access and utilize the system. This ensures that only authorized users or
systems can interact with the system.
Libraries or SDKs: In addition to APIs, the software interface may provide libraries or software
development kits (SDKs) that offer pre-built functions, classes, or modules to simplify the integration
and usage of the handwritten recognition system within other software applications or frameworks.
These libraries or SDKs may include code examples, documentation, and additional tools to aid
developers.
Integration Requirements: The software interface should outline any specific integration
requirements, such as dependencies on external libraries or software components, supported
programming languages, or operating systems. It should provide clear instructions and guidelines for
integrating the handwritten recognition system into different software environments.
Versioning and Compatibility: The software interface may include versioning information to ensure
backward compatibility and smooth transitions between different versions of the handwritten
recognition system. This allows developers to upgrade or adapt their integration without disrupting
existing functionality.
Performance and Scalability: The software interface should provide guidelines or recommendations
on performance considerations and scalability aspects. It may include information on processing
time, resource utilization, concurrent request handling, or strategies for scaling the system to handle
higher loads.
Documentation and Support: Comprehensive documentation, including API reference guides, usage
examples, and troubleshooting information, should accompany the software interface. This enables
developers to understand and effectively utilize the handwritten recognition system. Additionally,
providing support channels, such as forums, email support, or developer communities, can assist
users in resolving issues or seeking guidance.
3.3 Constraints
Data Quality and Variability: Handwritten text can exhibit significant variability in terms of
handwriting styles, quality, and legibility. The system may face challenges in accurately recognizing
text that is poorly written, has overlapping characters, or contains unusual variations. Handling such
variability and ensuring robustness in recognition is a constraint.
Training Data Availability: Developing accurate recognition models requires a large and diverse
dataset for training. Acquiring and preparing such training data, particularly with ground truth
annotations, can be a constraint. Limited availability of high-quality labeled data may impact the
system's accuracy and generalization capabilities.
Processing Time and Response Speed: Real-time or near real-time recognition is often desirable in
applications such as digital note-taking or online form filling. The system must operate within
acceptable time constraints to provide prompt responses, especially when processing large
documents or handling concurrent requests.
Language and Script Support: Handwritten recognition systems designed for specific languages or
scripts may face constraints in recognizing handwriting from different languages or scripts. Each
language or script may have unique characteristics that require specific modeling techniques or
additional data.
Accuracy and Error Rates: Handwritten recognition systems strive for high accuracy, but there will
always be some error rate due to the inherent challenges in recognizing handwriting. The system's
performance must be evaluated and balanced against acceptable error rates for specific applications
or use cases.
Deployment Constraints: The deployment environment can impose constraints on the system, such
as limited network connectivity, security requirements, or compatibility with existing infrastructure.
Adapting the system to work within these constraints is crucial for successful deployment.
User Interaction Constraints: The usability and user experience of the system can be influenced by
constraints related to the user interface, input methods, or interaction paradigms. Designing the
system to be intuitive, accommodating different user preferences, and providing suitable input
options is essential.
Privacy and Security: Handwritten recognition systems may handle sensitive information, such as
personal or financial data. Ensuring privacy and security in data handling, storage, and transmission
is a critical constraint that must be addressed to protect user confidentiality.