0% found this document useful (0 votes)

18 views6 pages

Tesseract

The 'tesseract' package is an open-source OCR engine that supports over 100 languages and is highly configurable for optimal text detection. It includes functions for extracting text from images and downloading training data for various languages. The package requires specific system dependencies and is maintained by Jeroen Ooms.

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views6 pages

Tesseract

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Package ‘tesseract’

March 23, 2025

Type Package
Title Open Source OCR Engine
Version 5.2.3
Description Bindings to 'Tesseract':
a powerful optical character recognition (OCR) engine that supports over 100 languages.
The engine is highly configurable in order to tune the detection algorithms and
obtain the best possible results.
License Apache License 2.0

URL https://docs.ropensci.org/tesseract/
https://ropensci.r-universe.dev/tesseract

BugReports https://github.com/ropensci/tesseract/issues
SystemRequirements Tesseract >= 3.03 (libtesseract-dev /
tesseract-devel) and Leptonica (libleptonica-dev /
leptonica-devel). On Debian you need to install the English
training data separately (tesseract-ocr-eng)
Imports Rcpp (>= 0.12.12), pdftools (>= 1.5), curl, rappdirs, digest
LinkingTo Rcpp
RoxygenNote 7.3.2
Suggests magick (>= 1.7), spelling, knitr, tibble, rmarkdown
Encoding UTF-8
VignetteBuilder knitr
Language en-US
NeedsCompilation yes
Author Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>)
Maintainer Jeroen Ooms <[email protected]>
Repository CRAN
Date/Publication 2025-03-23 14:50:01 UTC

1
2 ocr

Contents
ocr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
tesseract_download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Index 6

ocr Tesseract OCR

Description

Extract text from an image. Requires that you have training data for the language you are reading.
Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our
package vignette for image preprocessing tips.

Usage

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Arguments

image file path, url, or raw vector to image (png, tiff, jpeg, etc)
engine a tesseract engine created with tesseract(). Alternatively a language string
which will be passed to tesseract().
HOCR if TRUE return results as HOCR xml instead of plain text

Details

The ocr() function returns plain text by default, or hOCR text if hOCR is set to TRUE. The
ocr_data() function returns a data frame with a confidence rate and bounding box for each word
in the text.

References

Tesseract: Improving Quality

Other tesseract: tesseract(), tesseract_download()

tesseract 3

Examples
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)

cat(xml)

df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)

# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image

img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")

# Extract text from png image

text <- ocr(img_file)
unlink(img_file)
cat(text)

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract Tesseract Engine

Description
Create an OCR engine for a given language and control parameters. This can be used by the ocr
and ocr_data functions to recognize text.

Usage
tesseract(
language = "eng",
datapath = NULL,
configs = NULL,
options = NULL,
cache = TRUE
)

tesseract_params(filter = "")

tesseract_info()
4 tesseract_download

Arguments
language string with language for training data. Usually defaults to eng
datapath path with the training data for this language. Default uses the system library.
configs character vector with files, each containing one or more parameter values. These
config files can exist in the current directory or one of the standard tesseract
config files that live in the tessdata directory. See details.
options a named list with tesseract parameters. See details.
cache speed things up by caching engines
filter only list parameters containing a particular string

Details
Tesseract control parameters can be set either via a named list in the options parameter, or in a
config file text file which contains the parameter name followed by a space and then the value, one
per line. Use tesseract_params() to list or find parameters. Note that that some parameters are
only supported in certain versions of libtesseract, and that invalid parameters can sometimes cause
libtesseract to crash.

See Also
Other tesseract: ocr(), tesseract_download()

Examples
tesseract_params('debug')

tesseract_download Tesseract Training Data

Description
Helper function to download training data from the official tessdata repository. On Linux, the fast
training data can be installed directly with yum or apt-get.

Usage
tesseract_download(
lang,
datapath = NULL,
model = c("fast", "best"),
progress = interactive()
)
tesseract_download 5

Arguments
lang three letter code for language, see tessdata repository.
datapath destination directory where to download store the file
model either fast or best is currently supported. The latter downloads more accurate
(but slower) trained models for Tesseract 4.0 or higher
progress print progress while downloading

Details
Tesseract uses training data to perform OCR. Most systems default to English training data. To
improve OCR performance for other languages you can to install the training data from your distri-
bution. For example to install the spanish training data:

• tesseract-ocr-spa (Debian, Ubuntu)

• tesseract-langpack-spa (Fedora, EPEL)
On Windows and MacOS you can install languages using the tesseract_download function which
downloads training data directly from github and stores it in a the path on disk given by the
TESSDATA_PREFIX variable.

References
tesseract wiki: training data

See Also
Other tesseract: ocr(), tesseract()

Examples
## Not run:
if(is.na(match("fra", tesseract_info()$available)))
tesseract_download("fra", model = 'best')
french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)

## End(Not run)
Index

∗ tesseract
ocr, 2
tesseract, 3
tesseract_download, 4

ocr, 2, 3–5
ocr_data, 3
ocr_data (ocr), 2

tessdata (tesseract_download), 4
tesseract, 2, 3, 5
tesseract(), 2
tesseract_download, 2, 4, 4, 5
tesseract_info (tesseract), 3
tesseract_params (tesseract), 3
tesseract_params(), 4

Syscon Error Codes - PS3 Developer Wiki
No ratings yet
Syscon Error Codes - PS3 Developer Wiki
22 pages
Installing and Using Tesseract OCR PDF
100% (1)
Installing and Using Tesseract OCR PDF
5 pages
T92 Manual
100% (2)
T92 Manual
9 pages
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
No ratings yet
OpenCV OCR and Text Recognition With Tesseract - PyImageSearch
65 pages
Explain Following CSS Properties
No ratings yet
Explain Following CSS Properties
8 pages
PHD Thesis Delft University of Technology
100% (2)
PHD Thesis Delft University of Technology
6 pages
DS Problem Solutions
No ratings yet
DS Problem Solutions
7 pages
Ocr Nanonets Tesseract
No ratings yet
Ocr Nanonets Tesseract
39 pages
GlassJet AR6000 Operation Manual Rev E
No ratings yet
GlassJet AR6000 Operation Manual Rev E
196 pages
Home UB-Mannheim-tesseract Wiki GitHub
No ratings yet
Home UB-Mannheim-tesseract Wiki GitHub
4 pages
Module 1 Introduction and Dart Programming
No ratings yet
Module 1 Introduction and Dart Programming
282 pages
PDS Unit1-1
No ratings yet
PDS Unit1-1
104 pages
Ocrmypdf Readthedocs Io en Stable
No ratings yet
Ocrmypdf Readthedocs Io en Stable
147 pages
Written Notes
No ratings yet
Written Notes
5 pages
IG 12 Win Desktop WinLogin Admin Iss2
No ratings yet
IG 12 Win Desktop WinLogin Admin Iss2
167 pages
Tesseract Ocr
No ratings yet
Tesseract Ocr
3 pages
Tesseract 1
No ratings yet
Tesseract 1
35 pages
Hazardous Waste Online Application
No ratings yet
Hazardous Waste Online Application
40 pages
Ocr Gtts
No ratings yet
Ocr Gtts
49 pages
Project Report Format (AspirationIWish)
No ratings yet
Project Report Format (AspirationIWish)
33 pages
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
No ratings yet
OCR With Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment
22 pages
Pratham Singh (OS)
No ratings yet
Pratham Singh (OS)
10 pages
Ruijie Reyee RG-EST and RG-AirMetro Series Wireless Bridges B11P300 Release Notes (V1.3)
No ratings yet
Ruijie Reyee RG-EST and RG-AirMetro Series Wireless Bridges B11P300 Release Notes (V1.3)
12 pages
Ahsbsdns
No ratings yet
Ahsbsdns
1 page
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
Module # 10C - Text Recognition With Tesseract OCR
No ratings yet
Module # 10C - Text Recognition With Tesseract OCR
8 pages
WSMA 2 - Display Ads
No ratings yet
WSMA 2 - Display Ads
52 pages
Sage 300 People System Requirements
No ratings yet
Sage 300 People System Requirements
11 pages
98DSP
No ratings yet
98DSP
8 pages
The Lay of The Land: Chapter Objectives
No ratings yet
The Lay of The Land: Chapter Objectives
35 pages
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
No ratings yet
Tesseract OCR Engine: Svetlin Nakov and Veselin Kolev
19 pages
CYBV 388 Syllabus Fall 2023 15W
No ratings yet
CYBV 388 Syllabus Fall 2023 15W
10 pages
Cursors
No ratings yet
Cursors
14 pages
Pdftools
No ratings yet
Pdftools
6 pages
Virtualization and Cloud Computing
No ratings yet
Virtualization and Cloud Computing
8 pages
Tesseract OSCON
No ratings yet
Tesseract OSCON
22 pages
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
No ratings yet
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
5 pages
CS 3220: Operating Systems: Instructor
No ratings yet
CS 3220: Operating Systems: Instructor
14 pages
An Overview of Tesseract OCR Engine
No ratings yet
An Overview of Tesseract OCR Engine
15 pages
TLE10 Types of Malwares
No ratings yet
TLE10 Types of Malwares
24 pages
Safecast
No ratings yet
Safecast
13 pages
Pdftools
No ratings yet
Pdftools
6 pages
PDF Metadata - Document Capture - Recherche Google
No ratings yet
PDF Metadata - Document Capture - Recherche Google
4 pages
Study of Tesseract OCR
No ratings yet
Study of Tesseract OCR
12 pages
An Overview of The Tesseract OCR Engine
No ratings yet
An Overview of The Tesseract OCR Engine
5 pages
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
No ratings yet
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
7 pages
COMP6153 Operating System: Practicum Case
No ratings yet
COMP6153 Operating System: Practicum Case
9 pages
Manual
No ratings yet
Manual
2 pages
Tesseract Osc On
No ratings yet
Tesseract Osc On
22 pages
Package Pdftools': R Topics Documented
No ratings yet
Package Pdftools': R Topics Documented
6 pages
Iqjaqokskss
No ratings yet
Iqjaqokskss
3 pages
F5 LTM
No ratings yet
F5 LTM
18 pages
Jaison
No ratings yet
Jaison
5 pages
Name: Akshitha Paduru
No ratings yet
Name: Akshitha Paduru
4 pages
SWOT สาหร่าย PDF
No ratings yet
SWOT สาหร่าย PDF
1 page
Tesseract
No ratings yet
Tesseract
6 pages
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
No ratings yet
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
10 pages
How To
No ratings yet
How To
2 pages
Extracting Text From Scanned PDF Using Pytesseract & Open CV
No ratings yet
Extracting Text From Scanned PDF Using Pytesseract & Open CV
9 pages
Google Group Tesseract Ocr
No ratings yet
Google Group Tesseract Ocr
3 pages
Combine Lang Model 1
No ratings yet
Combine Lang Model 1
1 page
Beamex White Paper - A Behind The Scenes Look at A Calibration Process Change
No ratings yet
Beamex White Paper - A Behind The Scenes Look at A Calibration Process Change
3 pages
Package Tesseract': July 25, 2019
No ratings yet
Package Tesseract': July 25, 2019
5 pages
An Overview of The Tesseract OCR Engine: 2. Architecture
No ratings yet
An Overview of The Tesseract OCR Engine: 2. Architecture
6 pages
The Ultimate C - C - TS412 - 1909 - SAP Certified Application Associate - SAP S4HANA Project Systems
No ratings yet
The Ultimate C - C - TS412 - 1909 - SAP Certified Application Associate - SAP S4HANA Project Systems
2 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
Tesseract I CD Ar 2007
No ratings yet
Tesseract I CD Ar 2007
5 pages
Python Tesseract
No ratings yet
Python Tesseract
2 pages
Code Snippets
No ratings yet
Code Snippets
2 pages
Python Quebrar Captch Python Ocr
No ratings yet
Python Quebrar Captch Python Ocr
4 pages
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
No ratings yet
We Used Tesseract OCR For Train The Data and Recognize The Character From Digital Image Under The Apache 2
1 page
Installing and Using Tesseract 500 OCRFINAL
No ratings yet
Installing and Using Tesseract 500 OCRFINAL
4 pages
OCRHindi Using VietOCR and Tesseract PDF
No ratings yet
OCRHindi Using VietOCR and Tesseract PDF
7 pages
The Definitive Guide to PowerShell
From Everand
The Definitive Guide to PowerShell
Wesley Dunne
No ratings yet
Mastering TensorFlow: From Basics to Expert Proficiency
From Everand
Mastering TensorFlow: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering Elasticsearch 5.x - Third Edition
From Everand
Mastering Elasticsearch 5.x - Third Edition
Bharvi Dixit
3/5 (1)
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Ian Talks JS A-Z: WebDevAtoZ, #1
From Everand
Ian Talks JS A-Z: WebDevAtoZ, #1
Ian Eress
No ratings yet
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
From Everand
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
Bharvi Dixit
No ratings yet
JDK Tutorials - Herong's Tutorial Examples
From Everand
JDK Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Troubleshooting Ubuntu Server
From Everand
Troubleshooting Ubuntu Server
Bhargav Skanda
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
The Project Gutenberg RST Manual
From Everand
The Project Gutenberg RST Manual
Marcello Perathoner
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet

Tesseract

Uploaded by

Tesseract

Uploaded by

Package ‘tesseract’

March 23, 2025

ocr Tesseract OCR

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Tesseract: Improving Quality

Other tesseract: tesseract(), tesseract_download()

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)

# Render pdf to png image

# Extract text from png image

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract Tesseract Engine

tesseract_download Tesseract Training Data

• tesseract-ocr-spa (Debian, Ubuntu)

You might also like