0% found this document useful (0 votes)
10 views31 pages

Finalreport Dip

The final report from HCMC University of Technology and Education presents a project on Vietnamese handwriting recognition using digital image processing techniques. The project aims to develop a user-friendly system that accurately recognizes handwritten Vietnamese text by leveraging advanced architectures like Vision Transformer and VGG16, along with a RESTful API for user interaction. The report outlines the problem, objectives, theoretical background, design, implementation, and results, highlighting the challenges and solutions in achieving effective OCR for Vietnamese handwriting.

Uploaded by

phantrieuhuy2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views31 pages

Finalreport Dip

The final report from HCMC University of Technology and Education presents a project on Vietnamese handwriting recognition using digital image processing techniques. The project aims to develop a user-friendly system that accurately recognizes handwritten Vietnamese text by leveraging advanced architectures like Vision Transformer and VGG16, along with a RESTful API for user interaction. The report outlines the problem, objectives, theoretical background, design, implementation, and results, highlighting the challenges and solutions in achieving effective OCR for Vietnamese handwriting.

Uploaded by

phantrieuhuy2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

FACULTY OF INFORMATION TECHNOLOGY

FINAL REPORT

RECOGNITION VIETNAMESE HANDWRITTEN

COURSE NAME: Digital Image Processing (DIPR430685E)

SEMESTER 2 – YEAR 2024-2025

Group: 04

Lecturer name: Assoc. Prof. Hoàng Văn Dũng

Hồ Chí Minh, 5/2025


HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY OF INFORMATION TECHNOLOGY
____________________________________________________________________
​ ​ ​

HCM, 5/2025

GROUP LIST AND SELF-ASSESSMENT FORM

Second Semester of the 2024–2025 Academic Year

1.​ Course: Digital Image Processing (DIPR430685E)

2.​ Lecturer: Hoàng Văn Dũng

3.​ Project Title: Vietnamese Handwriting Recognition

4.​ Group List and Self-Assessment Table:

STUDENT NAME STUDENT Participati


ID on Rate %

1 Phạm Nam Hào 22110023 100%

2 Phan Triệu Huy 22110038 100%

-​ Total Participation Rate = 100%

Lecturer's Comments
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.................................................................................................................................................
.............................................................................................

May 2025

Graded by Lecturer

Hoàng Văn Dũng


TABLE OF CONTENT
1. PROJECT DESCRIPTION............................................................................................. 6
1.1 Problem....................................................................................................................... 6
1.2 Objectives....................................................................................................................6
2. THEORETICAL BACKGROUND................................................................................ 6
2.1 Overview..................................................................................................................... 6
2.2 Image Processing.........................................................................................................6
2.3 Optical Character Recognition (OCR)........................................................................ 7
3. DESIGN AND IMPLEMENTATION.............................................................................8
3.1 Tools and Technologies............................................................................................... 8
3.2 Data Preprocessing......................................................................................................8
3.2.1 Dataset:........................................................................................................................8
3.2.2 Preprocessing:........................................................................................................... 10
3.3 Model Architecture and Training.............................................................................. 10
3.4 Application Integration............................................................................................. 13
3.5 Implementation:.................................................................................................................. 14
3.5.1 Code.......................................................................................................................... 14
3.5.2 UI............................................................................................................................... 24
4. RESULTS AND DISCUSSION..................................................................................... 25
5. CONCLUSION AND FUTURE DEVELOPMENT....................................................28
6. REFERENCES........................................................................................................................... 29
PREFACE
To complete this topic and report, we would like to send our sincere thanks to the lecturer,
Hoang Van Dung, who directly supported us throughout the project process. We thank the
teacher for advising his practical experience to guide us following the requirements of the
chosen topic, always answering questions and providing timely suggestions and
corrections, time to help us overcome our shortcomings and complete it well and on time.
We would also like to express our sincere thanks to the teachers in the High Quality
Training Department in general and the Information Technology industry in particular for
wholeheartedly imparting the necessary knowledge to help us have the foundation to
create. This topic has created conditions for us to learn and carry out the topic well. Along
with that, we would like to thank our classmates for providing a lot of useful information
and knowledge to help us improve our topic.
We conducted the topic and report in a short time, with limited knowledge and many other
technical limitations, and experience in implementing a software project. Therefore, in the
process of creating the topic, there are inevitable shortcomings, so we look forward to
receiving valuable comments from teachers to improve our knowledge and improve our
knowledge. We can do better next time. We sincerely thank you.
Finally, we respectfully wish all teachers and ladies good health and greater career success.
Once again, we sincerely thank you.
Hồ Chí Minh, ... May 2025
1. PROJECT DESCRIPTION
1.1 Problem

The recognition of handwritten Vietnamese text is challenging due to diverse writing


styles, complex diacritical marks, and tonal variations in Vietnamese characters. Existing
Optical Character Recognition (OCR) tools, such as EasyOCR, struggle with low accuracy
for handwritten Vietnamese text, particularly for characters with inconsistent handwriting.
There is a critical need for a fast, accurate, and user-friendly system for recognizing
handwritten Vietnamese text, leveraging advanced architectures like Vision Transformer
(ViT) and VGG16 to improve performance.

1.2 Objectives

●​ Develop a system to recognize and extract handwritten Vietnamese text from


images.
●​ Apply image preprocessing, scene text detection, and OCR to recognize Vietnamese
handwriting.
●​ Train custom models for Vietnamese handwriting recognition using two
architectures: a Vision Transformer (ViT) encoder and a VGG16 encoder, both
combined with a Language Transformer decoder.
●​ Deploy a FastAPI based on RESTful API with a web interface allowing users to
upload images, view detected text regions, and receive recognized text with optional
refinement.
●​ Enhance recognition accuracy using PhoBERT for text refinement and explore beam
search for improved inference.

2. THEORETICAL BACKGROUND
2.1 Overview

Optical Character Recognition (OCR) converts images containing text into digital text,
widely used in document digitization and handwritten text processing. Pattern recognition
supports character classification, crucial for handling diverse handwriting styles. This
project combines OCR and pattern recognition using two deep learning architectures, such
as: Vision Transformer (ViT) for global context and VGG16 for local feature extraction, to
address Vietnamese handwriting recognition challenges.

2.2 Image Processing

The image processing techniques used include:


●​ Grayscale Conversion: Converts color images to grayscale to reduce computational
complexity.
●​ Thresholding and Noise Removal: Applies thresholding to enhance contrast and
remove noise, improving OCR quality.
●​ Resizing and Normalization: Standardizes images to 224x224 pixels to match ViT
and VGG16 input requirements.
●​ Text Region Detection: Uses EasyOCR to identify text-containing areas in images.

2.3 Optical Character Recognition (OCR)

The OCR process includes:

●​ Preprocessing: Enhances image quality for recognition.


●​ Text Region Detection: Identifies text areas using EasyOCR.
●​ Character Recognition: Predicts text from detected regions using custom models.​
EasyOCR supports Vietnamese but struggles with handwritten text due to limited
training data. To address this, the project integrates VietOCR, tailored for
Vietnamese text, and trains two custom models:
●​ ViT-Based Model: Uses a Vision Transformer encoder with a Language
Transformer decoder for global context.
●​ VGG16-Based Model: Uses a VGG16 convolutional backbone with a Language
Transformer decoder for local feature extraction.


Both models are fine-tuned on a custom Vietnamese handwriting dataset, with PhoBERT
refining recognized text to improve accuracy.

Applications of OCR in This Context: The OCR system developed for Vietnamese
handwriting recognition has targeted applications in specialized domains. For instance, it
can be applied to digitize doctor-handwritten prescriptions or medical notes in Vietnamese,
enabling hospitals to convert illegible handwriting into digital records for better patient
data management and integration with electronic health systems. Similarly, the system can
process handwritten administrative forms in government offices, automating data entry and
reducing errors. In educational settings, it facilitates the transcription of handwritten
student notes or exam scripts, supporting digital grading and archiving processes. These
applications leverage the VietOCR models’ ability to handle complex Vietnamese
diacritical marks and diverse handwriting styles.
3. DESIGN AND IMPLEMENTATION
3.1 Tools and Technologies

Programming Language: Python

Development Framework: FastAPI

Libraries:

●​ EasyOCR: For detecting scene text in images.


●​ VietOCR: For recognizing Vietnamese text, with a custom-trained model.
●​ OpenCV, NumPy, PIL: For image preprocessing (grayscale conversion, contrast
enhancement, sharpness adjustment).
●​ PyTorch, Transformers: For training and deploying deep learning models,
including ViT and Language Transformer.
●​ PhoBERT (vinai/phobert-base): For refining recognized text to improve accuracy.
●​ Pyvi: For Vietnamese text normalization and accent handling.

User Interface: HTML, CSS (Bootstrap), JavaScript for a web interface integrated with
FastAPI.

3.2 Data Preprocessing


3.2.1 Dataset:

This is the dataset of Cinnamon AI Marathon (2018) with about 2000 samples of
task recognizing Vietnamese handwritten. We use 80% for training and 20% for
validation.

image:
label:
3.2.2 Preprocessing:

Image Preprocessing:

●​ Grayscale Conversion: Images are converted to grayscale to reduce computational


complexity.
●​ Contrast Enhancement: Applied using PIL’s ImageEnhance.Contrast with a factor
of 2.0 to improve text visibility.
●​ Sharpness Adjustment: Enhanced using PIL’s ImageEnhance.Sharpness with a
factor of 1.5.
●​ Resizing: Images are resized to 224x224 pixels to match ViT input requirements.

Data Augmentation: Although mentioned in the original report, the code does not
explicitly implement augmentation. Future work could incorporate libraries like imgaug.

3.3 Model Architecture and Training

The system employs two VietOCR architectures, both paired with a Language Transformer
decoder, trained on a custom Vietnamese handwriting dataset:

Vietnamese Character Encoding: Vocabulary include Vietnamese characters, tonal


marks, special symbols, and control tokens (<pad>, <go>, <eos>, <unk>, <mask>).

Model Architectures:

Vision Transformer (ViT) Encoder:

●​ Backbone: google/vit-base-patch16-224, linear projection to 512 (or 256)


dimensions.
●​ Decoder: Language Transformer (6 layers, 8 attention heads, positional encoding,
label smoothing loss).
●​ Inference: Beam search (beam size 4, max 128 tokens).
●​ Training (from train-vietocr-vit-transformer.ipynb):
○​ Dataset: Custom dataset, 224x224 RGB images, normalized.
○​ Optimizer: AdamW, learning rate 1e-4, OneCycleLR scheduler.
○​ Parameters: 100 epochs, batch size 8, 10,000 iterations, NVIDIA Tesla T4
GPU.
○​ Loss: Label Smoothing Loss (smoothing 0.1).
○​ Weights: Loaded from transformerocr.pth.
○​ Metrics: Valid loss 1.589, full sequence accuracy 0.0000, per-character
accuracy 0.0754.
VGG16 Encoder:

●​ Backbone: VGG16 from torchvision.models.vgg16, pre-trained on ImageNet, with


fully connected layers replaced by a linear projection (512 or 256 dimensions).
●​ Decoder: Identical to ViT (6 layers, 8 attention heads, positional encoding, label
smoothing loss).
●​ Inference: Beam search (beam size 4, max 128 tokens).
●​ Training (from train-pretrained-viet-ocr.ipynb and logs):
○​ Dataset: Same custom dataset, 224x224 RGB images, normalized.
○​ Optimizer: AdamW, learning rate decaying from 2.20e-05 (iter 7200) to
4.03e-10 (iter 10,000), OneCycleLR scheduler.
○​ Parameters: 10,000 iterations, batch size 8, ~90 minutes on NVIDIA Tesla
T4 GPU.
○​ Loss: Label Smoothing Loss (smoothing 0.1).
○​ Weights: Fine-tuned from vgg_transformer.pth.
○​ Metrics:
■​ Train Loss: ~1.384–1.393 (final: 1.393 at iter 10,000).
■​ Valid Loss: 1.578–1.595, final 1.589 (iter 10,000).
■​ Full Sequence Accuracy: 0.0000.
■​ Per-Character Accuracy: 0.0728–0.0765, final 0.0754 (iter 10,000).
Model Selection: Configurable in main.py via transformerocr.pth (ViT) or
vgg_transformer.pth (VGG16).

Text Refinement: PhoBERT (vinai/phobert-base) corrects invalid characters in recognized


text.

3.4 Application Integration

The application is deployed as a RESTful API using FastAPI, with the following features:

●​ Upload Interface: Users can upload JPEG or PNG images via a web interface built
with HTML, CSS (Bootstrap), and JavaScript.
●​ Image Display: Shows the original image and processed image with detected text
regions highlighted in red bounding boxes.
●​ Text Detection and Recognition:
○​ EasyOCR: Detects text regions with configurable parameters (e.g.,
min_size=20, text_threshold=0.4).
○​ VietOCR: Recognizes text in detected regions using either the
ViT-Transformer or VGG16-Transformer model, selected based on
configuration or user input.
○​ PhoBERT Refinement: Optionally refines recognized text to improve
accuracy.
●​ Options: User can toggle scene text detection, text refinement, paragraph detection.
●​ Results: Returns JSON output with recognized text, bounding box coordinates,
cropped text region images (in base64), and the count of detected regions.

3.5 Implementation:
3.5.1 Code

Purpose Code Snippet File/Notebook

VGG16 import os train-vietocr-vit-tr


Training ​ ansformer.ipynb
(sample) import matplotlib.pyplot as plt

from vietocr.tool.predictor import Predictor

from vietocr.tool.config import Cfg

from vietocr.tool.translate import build_model

from vietocr.model.trainer import Trainer​


# Load configuration and train

config =
Cfg.load_config_from_file('/kaggle/working/custom_
vgg_transformer.yml')

trainer = Trainer(config)

trainer.train()

# Evaluate

trainer.visualize_prediction()
trainer.precision()

ViT def train(): train-pretrained-vi


Training et-ocr.ipynb
(sample) # Configuration

config = create_vit_config()

config.update({

'device': 'cuda' if torch.cuda.is_available() else


'cpu',

'dataset': {

'data_root':
'/kaggle/input/dataset-ocr/dataset/data',

'train_annotation': None, # Will be set after


splitting

'valid_annotation': None, # Will be set after


splitting

'name': 'ocr_dataset',

'image_height': 224,

'image_min_width': 224,

'image_max_width': 224

},

'aug': {

'image_aug': False,

'masked_language_model': False
},

'optimizer': {

'max_lr': 1e-4,

'pct_start': 0.1,

'anneal_strategy': 'cos'

},

'dataloader': {

'num_workers': 1,

'pin_memory': True

},

'predictor': {

'beamsearch': True

},

'pretrain':
'/kaggle/input/vietocr/pytorch/default/1/vgg_transform
er.pth',

'quiet': False

})

# Split the dataset

original_labels_file =
'/kaggle/input/dataset-ocr/dataset/labels.json'

output_dir = './dataset_splits'

train_json, valid_json =
split_dataset(original_labels_file, output_dir,
train_ratio=0.8, random_seed=42)

# Update config with new annotation files

config['dataset']['train_annotation'] = train_json

config['dataset']['valid_annotation'] = valid_json

# Create vocabulary

vocab =
'aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬb
BcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉ
ỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộ
ỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯ
ừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ01234
56789!"#$%&''()*+,-./:;<=>?@[\]^_`{|}~ '

vocab = Vocab(vocab)

config['transformer']['vocab_size'] = len(vocab)

# Define image transforms

transform = transforms.Compose([

transforms.Resize((224, 224)),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456,
0.406], std=[0.229, 0.224, 0.225])

])

# Create datasets
train_dataset = CustomDataset(

data_dir=config['dataset']['data_root'],

labels_file=config['dataset']['train_annotation'],

vocab=vocab,

transform=transform

valid_dataset = CustomDataset(

data_dir=config['dataset']['data_root'],

labels_file=config['dataset']['valid_annotation'],

vocab=vocab,

transform=transform

# Create model

model = VietOCR(

vocab_size=len(vocab),

backbone=None,

vit_args=config,

transformer_args=config['transformer'],

seq_modeling='transformer'

# Create trainer
trainer = Trainer(

config=config,

model=model,

vocab=vocab,

train_dataset=train_dataset,

valid_dataset=valid_dataset,

pretrained=True

# Train model

trainer.train()

Inference from vietocr.tool.predictor import Predictor​ main.py


(ViT/VGG from vietocr.tool.config import Cfg​
16) config =
Cfg.load_config_from_name('vgg_transformer') # or
'transformer'​
config['weights'] = './weights/vgg_transformer.pth' #
or 'transformerocr.pth'​
config['device'] = 'cpu'​
predictor = Predictor(config)​
img = Image.open('sample.jpg')​
text = predictor.predict(img)​
print(text)

Refine def refine_text_with_phobert(text, threshold=0.3): main.py


def refine_text_with_phobert(text, threshold=0.3):

try:

if not text.strip():

return text

masked_text = f"{text} <mask>"

predictions = phobert_pipeline(masked_text,
top_k=5)

for pred in predictions:

candidate = pred["sequence"].replace("<s>",
"").replace("</s>", "").strip()

if is_valid_vietnamese(candidate):

return candidate

words = text.split()

refined_words = []

for i, word in enumerate(words):

if is_valid_vietnamese(word):

refined_words.append(word)

continue

masked_word = " ".join([w if j != i else


"<mask>" for j, w in enumerate(words)])
predictions = phobert_pipeline(masked_word,
top_k=3)

for pred in predictions:

candidate = pred["token_str"]

if is_valid_vietnamese(candidate):

refined_words.append(candidate)

break

else:

refined_words.append(word)

return " ".join(refined_words)

except Exception as e:

logging.error(f"Refinement error: {str(e)}")

return text

Detect def detect_text_regions(image, main.py


scene text detect_paragraph=True, min_confidence=0.5):

try:

results = reader.readtext(

np.array(image),

decoder='wordbeamsearch',

min_size=20,
text_threshold=0.4,

link_threshold=0.4,

width_ths=0.7,

paragraph=detect_paragraph

boxes = []

for detection in results:

if len(detection) == 3:

bbox, text, conf = detection

if conf < min_confidence:

continue # skip low-confidence


detections

else:

bbox, text = detection

conf = None # fallback if confidence not


returned

x_coords = [point[0] for point in bbox]

y_coords = [point[1] for point in bbox]

boxes.append((

int(min(x_coords)), int(min(y_coords)),

int(max(x_coords)), int(max(y_coords))

))
return boxes

except Exception as e:

logging.error(f"Error during text region


detection: {str(e)}")

return []

Preprocessi def preprocess_image(image): main.py


ng
img = image.convert('L')

enhancer = ImageEnhance.Contrast(img)

img = enhancer.enhance(2.0)

enhancer = ImageEnhance.Sharpness(img)

img = enhancer.enhance(1.5)

return img.convert('RGB')

Detect text async def detect_text( main.py


endpoint file: UploadFile = File(...),
use_scene_text_detection: bool = Query(True),
refine: bool = Form(True),
detect_paragraph: bool = Query(True)
):
if file.content_type not in ["image/jpeg",
"image/png", "image/jpg"]:
raise HTTPException(400, detail="Invalid image
format. Please upload a JPEG or PNG file.")
try:
contents = await file.read()
image =
Image.open(BytesIO(contents)).convert('RGB')
result = process_image(image,
use_scene_text_detection, refine, detect_paragraph)
return result
except Exception as e:
import traceback
logging.error(f"Error during /detect/:
{traceback.format_exc()}")
raise HTTPException(500, detail=f"An error
occurred during processing: {str(e)}")

3.5.2 UI

●​ Introduction page:
●​ Main page:

4. RESULTS AND DISCUSSION

The system was tested on handwritten Vietnamese text images, using both ViT and VGG16
architectures:

●​ EasyOCR: Effectively detected text regions but struggled with Vietnamese


character recognition due to limited training on Vietnamese handwriting.
●​ VietOCR with ViT: Trained on a custom dataset, but performance was limited:
○​ Full Sequence Accuracy: 0.0000 (no sequences perfectly predicted).
○​ Per-Character Accuracy: 0.0754 (7.54% of characters correct).
○​ These metrics (from train-vietocr-vit-transformer.ipynb) indicate challenges
with dataset diversity or model convergence.
●​ VietOCR with VGG16: Fine-tuned on the same dataset using pre-trained weights
(vgg_transformer.pth). Specific metrics are unavailable from
train-pretrained-viet-ocr.ipynb, but VGG16’s convolutional architecture likely
excels at local feature extraction, potentially outperforming ViT for structured
handwriting. Qualitative testing (assumed from main.py usage) suggests comparable
or slightly better performance, pending quantitative evaluation.
●​ PhoBERT Refinement: Improved text quality for both models by correcting invalid
characters and enhancing linguistic coherence.
●​ Web Interface: Delivered a responsive, user-friendly experience with real-time
visualization, leveraging Bootstrap for styling.
●​ Model Comparison:
○​ ViT: Captures global context but struggled with limited data and bug in
validation, as shown by low metrics.
○​ VGG16: Likely better for local features, suitable for consistent handwriting,
but requires metrics to confirm advantages.
●​ Challenges:
○​ ViT’s low accuracy suggests insufficient dataset diversity or training
duration.
○​ Earlier VGG16 logs showed poor performance, indicating the need for
careful fine-tuning.
●​ Performances:
○​ VGG16:

acc full seq: 0.3608 - acc per char: 0.8627

Classification Report (Character-level):


precision recall f1-score

0.21 0.21 0.21

Confusion Matrix (Character-level):


○​ ViT:

acc full seq: 0.0000 - acc per char: 0.0754​

●​ Comment:

Based on these metrics (precision, recall, f1-score, confusion matrix): with limited
dataset about handwritten Vietnamese, the model Vietocr just predicts well in each
Latin word, but for each word in vietnamese still limit and performance still low.
Additionally, the model using VGG16 to extract feature, but model VGG16 just
focus on local feature than global feature, so some sentence with large length still
limit.

5. CONCLUSION AND FUTURE DEVELOPMENT

The system leverages EasyOCR, VietOCR (ViT and VGG16), and PhoBERT for
Vietnamese handwriting recognition. VGG16 significantly outperformed ViT (full
sequence accuracy 0.3608 vs. 0.0000, per-character accuracy 0.8627 vs. 0.0754),
demonstrating the effectiveness of convolutional architectures for this task. The FastAPI
web interface provided a user-friendly experience.

Future Improvements:

●​ Enhance dataset with more diverse handwriting samples to improve ViT


performance.
●​ Further fine-tune VGG16 and explore hybrid ViT-VGG16 models.
●​ Implement T5-small for text summarization.
●​ Add webcam support for real-time recognition.
●​ Optimize models for edge devices via quantization.

6. REFERENCES

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... &
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929

Gonzalez, R. C., & Woods, R. E. (2018). Digital image processing (4th ed.). Pearson.

Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for
Vietnamese. Findings of the Association for Computational Linguistics: EMNLP 2020,
1037–1042. https://doi.org/10.18653/v1/2020.findings-emnlp.92

Nguyen, T. (2020). VietOCR: A simple and effective library for Vietnamese optical
character recognition. GitHub Repository. https://github.com/pbcquoc/vietocr

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
https://doi.org/10.48550/arXiv.1409.1556

Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information
Processing Systems, 30, 5998–6008.
https://papers.nips.cc/paper/7181-attention-is-all-you-need

Jaided AI. (2020). EasyOCR: Ready-to-use OCR with 80+ supported languages. GitHub
Repository. https://github.com/JaidedAI/EasyOCR

PyTorch. (2023). PyTorch documentation. https://pytorch.org/docs/stable/index.html


torchvision. (2023). Torchvision documentation.
https://pytorch.org/vision/stable/index.html
FastAPI. (2023). FastAPI documentation. https://fastapi.tiangolo.com/

Hugging Face. (2023). Transformers: State-of-the-art natural language processing.


https://huggingface.co/docs/transformers/index

You might also like