Finalreport Dip
Finalreport Dip
FINAL REPORT
Group: 04
HCM, 5/2025
Lecturer's Comments
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.............................................................................................
May 2025
Graded by Lecturer
1.2 Objectives
2. THEORETICAL BACKGROUND
2.1 Overview
Optical Character Recognition (OCR) converts images containing text into digital text,
widely used in document digitization and handwritten text processing. Pattern recognition
supports character classification, crucial for handling diverse handwriting styles. This
project combines OCR and pattern recognition using two deep learning architectures, such
as: Vision Transformer (ViT) for global context and VGG16 for local feature extraction, to
address Vietnamese handwriting recognition challenges.
Both models are fine-tuned on a custom Vietnamese handwriting dataset, with PhoBERT
refining recognized text to improve accuracy.
Applications of OCR in This Context: The OCR system developed for Vietnamese
handwriting recognition has targeted applications in specialized domains. For instance, it
can be applied to digitize doctor-handwritten prescriptions or medical notes in Vietnamese,
enabling hospitals to convert illegible handwriting into digital records for better patient
data management and integration with electronic health systems. Similarly, the system can
process handwritten administrative forms in government offices, automating data entry and
reducing errors. In educational settings, it facilitates the transcription of handwritten
student notes or exam scripts, supporting digital grading and archiving processes. These
applications leverage the VietOCR models’ ability to handle complex Vietnamese
diacritical marks and diverse handwriting styles.
3. DESIGN AND IMPLEMENTATION
3.1 Tools and Technologies
Libraries:
User Interface: HTML, CSS (Bootstrap), JavaScript for a web interface integrated with
FastAPI.
This is the dataset of Cinnamon AI Marathon (2018) with about 2000 samples of
task recognizing Vietnamese handwritten. We use 80% for training and 20% for
validation.
image:
label:
3.2.2 Preprocessing:
Image Preprocessing:
Data Augmentation: Although mentioned in the original report, the code does not
explicitly implement augmentation. Future work could incorporate libraries like imgaug.
The system employs two VietOCR architectures, both paired with a Language Transformer
decoder, trained on a custom Vietnamese handwriting dataset:
Model Architectures:
The application is deployed as a RESTful API using FastAPI, with the following features:
● Upload Interface: Users can upload JPEG or PNG images via a web interface built
with HTML, CSS (Bootstrap), and JavaScript.
● Image Display: Shows the original image and processed image with detected text
regions highlighted in red bounding boxes.
● Text Detection and Recognition:
○ EasyOCR: Detects text regions with configurable parameters (e.g.,
min_size=20, text_threshold=0.4).
○ VietOCR: Recognizes text in detected regions using either the
ViT-Transformer or VGG16-Transformer model, selected based on
configuration or user input.
○ PhoBERT Refinement: Optionally refines recognized text to improve
accuracy.
● Options: User can toggle scene text detection, text refinement, paragraph detection.
● Results: Returns JSON output with recognized text, bounding box coordinates,
cropped text region images (in base64), and the count of detected regions.
3.5 Implementation:
3.5.1 Code
config =
Cfg.load_config_from_file('/kaggle/working/custom_
vgg_transformer.yml')
trainer = Trainer(config)
trainer.train()
# Evaluate
trainer.visualize_prediction()
trainer.precision()
config = create_vit_config()
config.update({
'dataset': {
'data_root':
'/kaggle/input/dataset-ocr/dataset/data',
'name': 'ocr_dataset',
'image_height': 224,
'image_min_width': 224,
'image_max_width': 224
},
'aug': {
'image_aug': False,
'masked_language_model': False
},
'optimizer': {
'max_lr': 1e-4,
'pct_start': 0.1,
'anneal_strategy': 'cos'
},
'dataloader': {
'num_workers': 1,
'pin_memory': True
},
'predictor': {
'beamsearch': True
},
'pretrain':
'/kaggle/input/vietocr/pytorch/default/1/vgg_transform
er.pth',
'quiet': False
})
original_labels_file =
'/kaggle/input/dataset-ocr/dataset/labels.json'
output_dir = './dataset_splits'
train_json, valid_json =
split_dataset(original_labels_file, output_dir,
train_ratio=0.8, random_seed=42)
config['dataset']['train_annotation'] = train_json
config['dataset']['valid_annotation'] = valid_json
# Create vocabulary
vocab =
'aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬb
BcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉ
ỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộ
ỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯ
ừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ01234
56789!"#$%&''()*+,-./:;<=>?@[\]^_`{|}~ '
vocab = Vocab(vocab)
config['transformer']['vocab_size'] = len(vocab)
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456,
0.406], std=[0.229, 0.224, 0.225])
])
# Create datasets
train_dataset = CustomDataset(
data_dir=config['dataset']['data_root'],
labels_file=config['dataset']['train_annotation'],
vocab=vocab,
transform=transform
valid_dataset = CustomDataset(
data_dir=config['dataset']['data_root'],
labels_file=config['dataset']['valid_annotation'],
vocab=vocab,
transform=transform
# Create model
model = VietOCR(
vocab_size=len(vocab),
backbone=None,
vit_args=config,
transformer_args=config['transformer'],
seq_modeling='transformer'
# Create trainer
trainer = Trainer(
config=config,
model=model,
vocab=vocab,
train_dataset=train_dataset,
valid_dataset=valid_dataset,
pretrained=True
# Train model
trainer.train()
try:
if not text.strip():
return text
predictions = phobert_pipeline(masked_text,
top_k=5)
candidate = pred["sequence"].replace("<s>",
"").replace("</s>", "").strip()
if is_valid_vietnamese(candidate):
return candidate
words = text.split()
refined_words = []
if is_valid_vietnamese(word):
refined_words.append(word)
continue
candidate = pred["token_str"]
if is_valid_vietnamese(candidate):
refined_words.append(candidate)
break
else:
refined_words.append(word)
except Exception as e:
return text
try:
results = reader.readtext(
np.array(image),
decoder='wordbeamsearch',
min_size=20,
text_threshold=0.4,
link_threshold=0.4,
width_ths=0.7,
paragraph=detect_paragraph
boxes = []
if len(detection) == 3:
else:
boxes.append((
int(min(x_coords)), int(min(y_coords)),
int(max(x_coords)), int(max(y_coords))
))
return boxes
except Exception as e:
return []
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2.0)
enhancer = ImageEnhance.Sharpness(img)
img = enhancer.enhance(1.5)
return img.convert('RGB')
3.5.2 UI
● Introduction page:
● Main page:
The system was tested on handwritten Vietnamese text images, using both ViT and VGG16
architectures:
● Comment:
Based on these metrics (precision, recall, f1-score, confusion matrix): with limited
dataset about handwritten Vietnamese, the model Vietocr just predicts well in each
Latin word, but for each word in vietnamese still limit and performance still low.
Additionally, the model using VGG16 to extract feature, but model VGG16 just
focus on local feature than global feature, so some sentence with large length still
limit.
The system leverages EasyOCR, VietOCR (ViT and VGG16), and PhoBERT for
Vietnamese handwriting recognition. VGG16 significantly outperformed ViT (full
sequence accuracy 0.3608 vs. 0.0000, per-character accuracy 0.8627 vs. 0.0754),
demonstrating the effectiveness of convolutional architectures for this task. The FastAPI
web interface provided a user-friendly experience.
Future Improvements:
6. REFERENCES
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... &
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
Gonzalez, R. C., & Woods, R. E. (2018). Digital image processing (4th ed.). Pearson.
Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for
Vietnamese. Findings of the Association for Computational Linguistics: EMNLP 2020,
1037–1042. https://doi.org/10.18653/v1/2020.findings-emnlp.92
Nguyen, T. (2020). VietOCR: A simple and effective library for Vietnamese optical
character recognition. GitHub Repository. https://github.com/pbcquoc/vietocr
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
https://doi.org/10.48550/arXiv.1409.1556
Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information
Processing Systems, 30, 5998–6008.
https://papers.nips.cc/paper/7181-attention-is-all-you-need
Jaided AI. (2020). EasyOCR: Ready-to-use OCR with 80+ supported languages. GitHub
Repository. https://github.com/JaidedAI/EasyOCR