0% found this document useful (0 votes)
191 views9 pages

Extracting Text From Images

The document compares 10 free optical character recognition (OCR) tools, including both online services and desktop software. It discusses the pros and cons of online versus desktop OCR, then reviews several specific free online OCR services and desktop OCR programs, assessing their input and output capabilities as well as language support. The author's recommended online service is due to its accuracy and language support, while noting its limited free page capacity.

Uploaded by

Lance David
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views9 pages

Extracting Text From Images

The document compares 10 free optical character recognition (OCR) tools, including both online services and desktop software. It discusses the pros and cons of online versus desktop OCR, then reviews several specific free online OCR services and desktop OCR programs, assessing their input and output capabilities as well as language support. The author's recommended online service is due to its accuracy and language support, while noting its limited free page capacity.

Uploaded by

Lance David
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

How to extract text from images: a comparison of 10 free OCR tools

freewaregenius.com OCR Illustration6_e

Printing text to paper is done every day; on some occasions however the reverse is needed getting the original text back from a scanned image or photograph, for further editing and use. This conversion is named Optical Character Recognition or OCR for short, and it can convert scanned books and documents into editable text, to get editable text from PDFs created via scanning, or even get text from screenshots and images. There are a variety of tools available for character recognition and some of them are free to use. This article will help you find and choose between several free OCR tools.

Online OCR services vs. desktop OCR software


Selecting the right OCR tool depends on your specific needs. Generally OCR tools can be divided into two online services and desktop software, both of them have their positive and negative sides. Online services will require that you upload your files on the internet to their servers, so there may be privacy concerns as well as time/bandwidth concerns if your document is big. Most have limits to file size and count of pages to process daily/weekly that they will process for free; for bigger jobs they require to buy extra processing power. On the flip side, many of these services are really good at the OCR itself.

With Desktop Software you dont need to worry about uploading sensitive information to foreign servers, or whether your file will take too long to upload. Some desktop software programs generally give better text review options, and some offer integration with scanner software. A note on comparing OCR software: OCR programs are not mainstream applications so there is only limited number of freeware titles available, unlike for example media

players or file managers. In this article we aimed to provide the complete list of items found and evaluated at the present moment. This is because OCR results tend to vary; the accuracy of different OCR solutions depends on the quality, file format and fonts used in the source documents. For instance some programs provide better quality with typewriter fonts and worse results with screen fonts whereas other program perform exactly the opposite. We therefore shied away from a head-to-head comparison of OCR accuracy in this article as the rating can be unjust for the specific files you might need to process. There is some general information about getting good OCR result . We reviewed the following online OCR services and desktop OCR programs, all of which are either FREE or have a free component. Online OCR services Desktop software

Quick links: click to jump to our recommendation for online OCR services and desktop OCR software. Also, see our recommendations for better OCR results.

Part1: Online OCR software


Online OCR software is available through the web browser and you dont have to install new software on your computer. All you need is to get the image file using scanner or a digital photo camera, upload it through the online OCR web page and wait for the processed file to download. If you have a Gmail or other Google account you might try Google Docs first. Google Docs is not a dedicated OCR tool but it provides the OCR power Google uses to digitize books and process PDFs for their search engine. To get text from image or PDF files you need to first upload and convert the files to Google Docs. Then you can do the further editing online or/and download it back as PDF, DOC, TXT etc. In Google Docs to upload the files first you need to click Upload button, select Settings from the menu and check Convert uploaded files to Google docs format and Convert text from uploaded PDF and images files and then click Upload/Files.Another way is to check Confirm settings before each upload after clicking Upload/Settings so that every time you upload a file it is asked whether you want to convert the file or leave it intact. This gives also an option to select which language dictionary will be used in the text recognition process. The file is therefore converted to Google Docs document having both original image(s) and converted text in it. You can review the text and delete the original images afterwards.

Google Docs conversion works pretty good, especially with English texts. Over 30 different languages can be selected but if your language is not included in the list, the conversion may give an error and the file will not be processed. Of course if you dont have a Google account you can create one any time. Input image file types: most bitmap formats Input PDF files: yes Output file types: ODT, PDF, TXT, RTF, DOC, HTML Languages: 30+ Google Docs / PROS: Unlimited processing capacity CONS: Text in some minor languages may not be recognized

Free online OCR web page is more thoroughly reviewed in freewaregenius.com. Input image file types: GIF, BMP, JPEG, TIFF, PNG Input PDF files: yes Output file types: DOC, PDF, RTF, TXT Languages: English dictionary only Free Online OCR / PROS: No capacity limits for processing Keeps original formatting and Layout CONS: Only English dictionary supported. Text in other languages may be not recognized

Input image file types: TIF, JPEG, PNG, BMP, GIF, PBM, PGM, PPM Input PDF files: no Output file types: TXT languages: 30+ i2OCR/ PROS: No limits for uploading Has a review option after character recognition the original image and result text is shown sideCONS: Only text output, all the original formatting will be lost. Though at least it supports multi column pages correctly.

by-side on screen.

Creates hard linebreaks at the end of each line. Does not process PDF files.

Input image file types: JPG, TIFF, PNG, GIF Input PDF files: yes Output file types: TXT, PDF, RTF, DOC Languages: 150+ OCRonline/ PROS: Excellent recognition quality Rebuilds original formatting Impressive list of 150 language dictionaries CONS: Limited upload capacity 5 pages in a week, file size up to 10 MB. Need to pay to get extra pages.

Input image file types: JPG, JPEG, BMP, TIFF, GIF Input PDF files: only for registered users Output file types: DOC, XLS, TXT (+ PDF for registered users) Languages: 30+ Note: There is registered and guest mode available for this site. In guest mode 15 images per hour can be processed and maximum file size is 4 MB. There are some extra possibilities in registered mode, like uploading larger images, ZIP archives and multi-page PDFs. Initial credits after registering is for converting 20 pages. Online OCR / PROS: Supports some languages that other servers do not support. CONS: Limited upload capacity. Extra capacity may be purchased or earned by bonus program.

Our Recommendation: The last word on online OCR services


From the online OCR solutions reviewed above, provided good and stable OCR accuracy with a number of different fonts and texts. Unfortunately the free service is limited by 5

pages per week. If you need more capacity, try the other providers as they also may give good results depending on your source text.

Part2: Desktop OCR software


Desktop software you need to download and install to your computer, and they usually have more configurable options than online tools. Some programs include the ability to acquire image directly from a scanner so you dont need to use other programs to do that. The following OCR software will be reviewed: Cuneiform, OpenOCR, FreeOCR, gImageReader, Puma.NET and SimpleOCR. There are some more free tools available, which are mainly meant for more specific tasks. JOCR is for getting text from screenshots, requires Microsoft Office 2003 or later to be installed and has been previously reviewed here. Also there is Nuance PDF Reader that is able to upload scanned PDFs to its online service for character recognition. Nuance PDF Reader is previously reviewed here. And finally, there is , a program intended for converting document archive files from one format to another, like TIFF, PDF, RTF etc. MyMorph is able to convert image files to editable text files. OpenOCR is based on commercial product Cuneiform that was released as freeware on 2007. License: freeware Input image: most bitmap file formats Input PDF: no Scanner input: yes Output: TXT, RTF, HTML + output to Word/Excel Dictionary languages: 20+ Cuneiform OpenOCR / PROS: Includes both single file and batch of files processing mode. CONS:

Installation program creates invalid start menu shortcuts like NewFolder1

This is another of the programs that uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google. License: freeware Requires: Microsoft .NET

Input image: TIFF, multi-page TIFF Input PDF: yes Scanner input: yes Output: TXT Dictionary languages: 9 FreeOCR / PROS: Tesseract OCR engine has good accuracy. CONS: Only text output, no formatting recognition No multi-column support (must crop the image manually to one column)

gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here. License: freeware (GNU) Requires: Tesseract, need to download separately Input PDF: yes Dictionary languages: many, uses freely downloadable OpenOffice spellcheckers Scanner input: yes Input image: JPEG, GIF, PNG, TIFF Output: TXT gImageReader / PROS: Tesseract OCR engine has good accuracy OCR area(s) can be manually selected CONS: Only text output, no formatting recognition

Puma.NET is actually not a user solution but a development kit based on CuneiForm OCR engine, though it contains a sample program with the front-end. After installing there will be no launch icon in Start Menu but you can find the program Puma.Net.Sample.exe deep in the C:\ Program Files\ Puma.NET\ Sample\ bin\ x86\ Debug\folder.

License: freeware (BSD) Requires: Microsoft .NET Input image: BMP, GIF, EXIG, JPG, PNG and TIFF Input PDF: no Scanner input: no Output: TXT, RTF, HTML Dictionary languages: 27 Puma.NET / PROS: Font and formatting detection CONS: You have to create the shortcut to start the program by yourself Leaves hard linebreaks

SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document. License: free for all non-commercial purposes Input image: TIFF, JPG, BMP Input PDF: no Scanner input: yes Output: DOC, TXT Dictionary languages: 3 Note: SimpleOCR seems to give better results from color JPEGs, not grayscale. SimpleOCR / PROS: Word by word text revision Ability to train the engine to use specific fonts Includes both single file and batch of files processing mode CONS: Only 3 languages dictionary. No font and format detection

Our Recommendation: The last word on desktop OCR software

From the desktop OCR software reviewed above provided good accuracy with different fonts including artistic. Having said that, most of the programs performed also good processing text with simple fonts.

About OCR and how to get better results


OCR is used to turn printed books and documents back to text. OCR tools analyze the image, recognizes the characters/words and output them in form of editable text file. The character recognition is never perfect. By some studies the accuracy of the commercial OCR products vary from 70 98% and total accuracy can be achieved only with the help of human review. To improve accuracy most OCR tools also use dictionaries. Instead of recognizing individual characters they try to recognize whole words that exist in the selected dictionary. Some OCR software cannot detect fonts and formatting and can only give plain text as output. You then need to reapply all the formatting manually. But some of the OCR engines detect fonts like bold and italic, some of them also detect paragraph formatting, multiple columns, tables and images inside the text, so they can use this information to replicate the text in editable format like DOC, HTML etc. The source for character recognition can be qn image obtained by scanner, digital camera or screenshot. If you use a scanner and you have lot of pages you might use OCR software that has scanner support built in. The program then suggests the settings that give best results for OCR. Usually this means 300 dpi resolution (200 dpi minimal) and grayscale JPG or TIFF image. Some software like color images better than grayscale, though. So if you do not get best results it is recommended to try several settings, like 300 dpi color JPEG and 300 dpi grayscale JPG. Or TIFF instead of JPG. Getting decent OCR results using images taken by digital camera is quite difficult. Good light, no flash, straight paper, macro mode etc help to get better results as it is described for instance in this article. It is also possible to get text from screenshot files but it also needs some extra measures. Usually the resolution of a screenshot is 72 dpi but OCR need at least 200 dpi. Some OCR programs can automatically adjust the resolution of the image file, but for others you need to use some image manipulation program to convert the resolution to 200 dpi. For screenshots you can also use special programs like JOCR. OCR is often used to process PDF files. A PDF usually consists of images that are shown on screen and also the source text that you can select for copy-paste. But some of the PDFs contain only images, like scanned PDF files. Usual convert-PDF-to-Word type software cannot process these files. To extract text from PDF files that contain only images you need to use some OCR software that accepts PDF files for input.

About Priit Priit Lilleleht has written 4 post for this blog. Share the love:

You might also like