0% found this document useful (0 votes)
18 views6 pages

Tesseract

The 'tesseract' package is an open-source OCR engine that supports over 100 languages and is highly configurable for optimal text detection. It includes functions for extracting text from images and downloading training data for various languages. The package requires specific system dependencies and is maintained by Jeroen Ooms.

Uploaded by

abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

Tesseract

The 'tesseract' package is an open-source OCR engine that supports over 100 languages and is highly configurable for optimal text detection. It includes functions for extracting text from images and downloading training data for various languages. The package requires specific system dependencies and is maintained by Jeroen Ooms.

Uploaded by

abhishek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Package ‘tesseract’

March 23, 2025


Type Package
Title Open Source OCR Engine
Version 5.2.3
Description Bindings to 'Tesseract':
a powerful optical character recognition (OCR) engine that supports over 100 languages.
The engine is highly configurable in order to tune the detection algorithms and
obtain the best possible results.
License Apache License 2.0

URL https://docs.ropensci.org/tesseract/
https://ropensci.r-universe.dev/tesseract

BugReports https://github.com/ropensci/tesseract/issues
SystemRequirements Tesseract >= 3.03 (libtesseract-dev /
tesseract-devel) and Leptonica (libleptonica-dev /
leptonica-devel). On Debian you need to install the English
training data separately (tesseract-ocr-eng)
Imports Rcpp (>= 0.12.12), pdftools (>= 1.5), curl, rappdirs, digest
LinkingTo Rcpp
RoxygenNote 7.3.2
Suggests magick (>= 1.7), spelling, knitr, tibble, rmarkdown
Encoding UTF-8
VignetteBuilder knitr
Language en-US
NeedsCompilation yes
Author Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>)
Maintainer Jeroen Ooms <[email protected]>
Repository CRAN
Date/Publication 2025-03-23 14:50:01 UTC

1
2 ocr

Contents
ocr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
tesseract_download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Index 6

ocr Tesseract OCR

Description

Extract text from an image. Requires that you have training data for the language you are reading.
Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our
package vignette for image preprocessing tips.

Usage

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Arguments

image file path, url, or raw vector to image (png, tiff, jpeg, etc)
engine a tesseract engine created with tesseract(). Alternatively a language string
which will be passed to tesseract().
HOCR if TRUE return results as HOCR xml instead of plain text

Details

The ocr() function returns plain text by default, or hOCR text if hOCR is set to TRUE. The
ocr_data() function returns a data frame with a confidence rate and bounding box for each word
in the text.

References

Tesseract: Improving Quality

See Also

Other tesseract: tesseract(), tesseract_download()


tesseract 3

Examples
# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)


cat(xml)

df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)

# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image


img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")

# Extract text from png image


text <- ocr(img_file)
unlink(img_file)
cat(text)

engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract Tesseract Engine

Description
Create an OCR engine for a given language and control parameters. This can be used by the ocr
and ocr_data functions to recognize text.

Usage
tesseract(
language = "eng",
datapath = NULL,
configs = NULL,
options = NULL,
cache = TRUE
)

tesseract_params(filter = "")

tesseract_info()
4 tesseract_download

Arguments
language string with language for training data. Usually defaults to eng
datapath path with the training data for this language. Default uses the system library.
configs character vector with files, each containing one or more parameter values. These
config files can exist in the current directory or one of the standard tesseract
config files that live in the tessdata directory. See details.
options a named list with tesseract parameters. See details.
cache speed things up by caching engines
filter only list parameters containing a particular string

Details
Tesseract control parameters can be set either via a named list in the options parameter, or in a
config file text file which contains the parameter name followed by a space and then the value, one
per line. Use tesseract_params() to list or find parameters. Note that that some parameters are
only supported in certain versions of libtesseract, and that invalid parameters can sometimes cause
libtesseract to crash.

See Also
Other tesseract: ocr(), tesseract_download()

Examples
tesseract_params('debug')

tesseract_download Tesseract Training Data

Description
Helper function to download training data from the official tessdata repository. On Linux, the fast
training data can be installed directly with yum or apt-get.

Usage
tesseract_download(
lang,
datapath = NULL,
model = c("fast", "best"),
progress = interactive()
)
tesseract_download 5

Arguments
lang three letter code for language, see tessdata repository.
datapath destination directory where to download store the file
model either fast or best is currently supported. The latter downloads more accurate
(but slower) trained models for Tesseract 4.0 or higher
progress print progress while downloading

Details
Tesseract uses training data to perform OCR. Most systems default to English training data. To
improve OCR performance for other languages you can to install the training data from your distri-
bution. For example to install the spanish training data:

• tesseract-ocr-spa (Debian, Ubuntu)


• tesseract-langpack-spa (Fedora, EPEL)
On Windows and MacOS you can install languages using the tesseract_download function which
downloads training data directly from github and stores it in a the path on disk given by the
TESSDATA_PREFIX variable.

References
tesseract wiki: training data

See Also
Other tesseract: ocr(), tesseract()

Examples
## Not run:
if(is.na(match("fra", tesseract_info()$available)))
tesseract_download("fra", model = 'best')
french <- tesseract("fra")
text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
cat(text)

## End(Not run)
Index

∗ tesseract
ocr, 2
tesseract, 3
tesseract_download, 4

ocr, 2, 3–5
ocr_data, 3
ocr_data (ocr), 2

tessdata (tesseract_download), 4
tesseract, 2, 3, 5
tesseract(), 2
tesseract_download, 2, 4, 4, 5
tesseract_info (tesseract), 3
tesseract_params (tesseract), 3
tesseract_params(), 4

You might also like