Skip to content

Coptic optical character recognition (OCR) / printed character recognition (PCR)

Notifications You must be signed in to change notification settings

abanoubha/coptic-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 

Repository files navigation

Coptic OCR

Coptic is the final stage of ancient egyptian language.

resources

how to add support for Coptic in tesseract OCR ?

asked phind/chatgpt, and here is the answer.

To add support for Coptic in Tesseract OCR, you would need to train Tesseract with a dataset that includes Coptic script. However, as of the time of writing, Tesseract does not officially support the Coptic script.

To train Tesseract with a new language, you would need to follow these steps:

  1. Prepare your training data: You need a dataset that includes Coptic text and corresponding images. The dataset should be large enough and representative of the text you want Tesseract to recognize. The text should be in UTF-8 format and the images should be clear and high-resolution.
  2. Train Tesseract with the new language: Tesseract can be trained to recognize new languages using the tesstrain tool. The tesstrain tool is used to train Tesseract using synthetic images created using a UTF-8 training text and Unicode fonts to render the text. However, training from scanned images and transcription is supported via tesstrain makefile 7.
  3. Generate a new traineddata file: After training, a new .traineddata file will be generated. This file contains the trained model and should be placed in the tessdata directory of your Tesseract installation.
  4. Use the new traineddata file: You can now use the new traineddata file to recognize Coptic text. You would need to specify the Coptic language code (e.g., 'cop') when running Tesseract 4.

Here is an example of how to use the new traineddata file:

tesseract image.png output -l cop

Please note that the training process can be time-consuming and requires significant computational resources. Also, the quality of the OCR results depends on the quality of the training data and the training process.

If you are not able to train Tesseract with Coptic text, you might consider using a different OCR tool that supports Coptic, or requesting that the Tesseract community add support for Coptic.

About this repo

This is an experiment to create an acceptable Coptic OCR model. The main repo for Coptic OCR is here

About

Coptic optical character recognition (OCR) / printed character recognition (PCR)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published