0% found this document useful (0 votes)
13 views3 pages

OCR Assignment

Uploaded by

dhanjaljobanjit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

OCR Assignment

Uploaded by

dhanjaljobanjit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment: Fine-Tuning an OCR Model for Handwriting Recognition

Role

AI Engineer

Objective

As an AI Engineer, your task is to fine-tune a state-of-the-art Optical Character Recognition


(OCR) model to achieve high accuracy in recognizing handwritten text. The focus is on
leveraging the latest datasets and models to improve performance on diverse handwriting
styles, including noisy or irregular samples. This assignment simulates a real-world scenario
where your solution will be part of a document digitization pipeline.

Task Description

You are required to:

1.​ Select and fine-tune a modern OCR model for handwriting recognition.
2.​ Use the latest publicly available datasets to train and evaluate your model.
3.​ Optimize the model for accuracy and efficiency, considering real-world challenges like
varied handwriting styles, noise, and irregular layouts.
4.​ Provide a report detailing your methodology, results, and a brief justification of your
choices.

Specific Requirements

●​ Model: Choose a transformer-based OCR model as your starting point. Recommended


options include:
○​ TrOCR (Transformer-based OCR): Available via Hugging Face
(microsoft/trocr-large-handwritten), known for its strong performance
on handwritten text recognition. It combines a Vision Transformer (ViT) encoder
with a text Transformer decoder.
○​ DocTR (Document Text Recognition): An open-source option from Mindee,
combining text detection (DBNet++) and recognition (CRNN with Transformer
enhancements), optimized for both printed and handwritten text.
●​ Dataset: Use the following latest and diverse datasets for fine-tuning and evaluation:
○​ IAM Handwriting Database (Updated 2023 Version): Contains 13,353
handwritten English text lines from 657 writers. Access via the official IAM
website or Hugging Face datasets hub. Focus on the line-level annotations for
this task.
○​ Imgur5K (2021): A diverse dataset with ~135K handwritten English words across
5K images, offering variability in styles and real-world scenarios. Available via
Papers with Code.
○​ Synthetic Data (Optional): Generate additional synthetic handwritten data using
tools like TextRecognitionDataGenerator (GitHub:
Belval/TextRecognitionDataGenerator) to augment your training set with custom
styles or edge cases.
●​ Evaluation Metrics:
○​ Primary: Character Error Rate (CER) – measures the edit distance between
predicted and ground truth text at the character level.
○​ Secondary: Word Error Rate (WER) – assesses accuracy at the word level.
○​ Target: Achieve a CER ≤ 7% and WER ≤ 15% on a held-out test set from the IAM
dataset.
●​ Tools and Frameworks:
○​ Use PyTorch or TensorFlow for model implementation.
○​ Leverage Hugging Face Transformers for TrOCR or Mindee’s DocTR library
for ease of fine-tuning.
○​ Preprocessing: Apply OpenCV or PIL for image normalization (e.g., resizing to
384x384 for TrOCR, grayscale conversion, noise reduction).
●​ Fine-Tuning Process:
○​ Preprocess the dataset (e.g., normalize images, tokenize text using the model’s
tokenizer).
○​ Fine-tune the pre-trained model on the combined IAM and Imgur5K datasets for
at least 10 epochs, adjusting hyperparameters like learning rate (suggested:
5e-5) and batch size (suggested: 8, GPU-dependent).
○​ Use a validation split (10% of data) to monitor overfitting and early stopping if
needed.
●​ Hardware: You will use free GPU resources available on:
●​ Kaggle: Provides a Tesla P100 (16GB VRAM) or dual NVIDIA T4s (2x 16GB
VRAM) with ~30 hours of GPU time per week and a 12-hour session limit.
○​ Optimization for Fine-Tuning: Use a batch size of 4 with the P100 or 8
with dual T4s (enable multi-GPU via PyTorch’s DataParallel). Enable
mixed precision training (torch.cuda.amp) to reduce memory usage and
speed up training.
○​ Data Handling: Import IAM and Imgur5K directly via Kaggle’s dataset
hub or upload synthetic data as a custom dataset.
●​ Google Colab: Offers a Tesla T4 (16GB VRAM) or occasionally a K80 (12GB
VRAM) with a ~12-hour session limit (subject to availability).
○​ Optimization for Fine-Tuning: Set batch size to 4 for the T4 (or 2 for
K80), and use mixed precision training. If memory issues persist, apply
gradient accumulation (e.g., accumulate gradients over 2 steps to
simulate batch size 8).
○​ Data Handling: Mount Google Drive to load datasets or upload them
manually to Colab’s runtime.
●​ Both platforms suffice for this task with proper optimization. The 16GB VRAM
limit (common across free tiers) is slightly below a premium 24GB GPU, but
adjustments like smaller batch sizes or mixed precision ensure feasibility.
●​ Deliverables:
○​ Code: A Jupyter notebook or Python script with clear comments, implementing
the full pipeline (data loading, preprocessing, fine-tuning, evaluation).
○​ Model: Save the fine-tuned model weights in a standard format (e.g., .pth for
PyTorch or Hugging Face model hub upload).
○​ Report: A 1-2 page PDF summarizing:
■​ Dataset and model choices with justification.
■​ Preprocessing steps and fine-tuning strategy.
■​ Final CER and WER scores on the test set.
■​ Challenges faced and potential improvements.

Time Expectation

●​ Estimated completion time: 8-10 hours.


●​ This includes data preparation (2 hours), model setup and fine-tuning (4-5 hours),
evaluation (1 hour), and report writing (1-2 hours).

Evaluation Criteria

●​ Technical Accuracy: Correct implementation of fine-tuning pipeline and achievement of


target metrics (CER ≤ 7%, WER ≤ 15%).
●​ Code Quality: Clarity, modularity, and documentation of the codebase.
●​ Innovation: Creative approaches to preprocessing, data augmentation, or
hyperparameter tuning.
●​ Report: Conciseness, clarity, and depth of reasoning behind choices.

Context

This assignment mimics a real-world task where an AI Engineer must adapt a pre-trained model
to a specific use case (handwriting recognition for digitizing historical documents). The chosen
datasets and models reflect the latest advancements as of March 2025, ensuring the task is
both challenging and relevant.

You might also like