Extracting Text From Images
Extracting Text From Images
Printing text to paper is done every day; on some occasions however the reverse is needed getting the original text back from a scanned image or photograph, for further editing and use. This conversion is named Optical Character Recognition or OCR for short, and it can convert scanned books and documents into editable text, to get editable text from PDFs created via scanning, or even get text from screenshots and images. There are a variety of tools available for character recognition and some of them are free to use. This article will help you find and choose between several free OCR tools.
With Desktop Software you dont need to worry about uploading sensitive information to foreign servers, or whether your file will take too long to upload. Some desktop software programs generally give better text review options, and some offer integration with scanner software. A note on comparing OCR software: OCR programs are not mainstream applications so there is only limited number of freeware titles available, unlike for example media
players or file managers. In this article we aimed to provide the complete list of items found and evaluated at the present moment. This is because OCR results tend to vary; the accuracy of different OCR solutions depends on the quality, file format and fonts used in the source documents. For instance some programs provide better quality with typewriter fonts and worse results with screen fonts whereas other program perform exactly the opposite. We therefore shied away from a head-to-head comparison of OCR accuracy in this article as the rating can be unjust for the specific files you might need to process. There is some general information about getting good OCR result . We reviewed the following online OCR services and desktop OCR programs, all of which are either FREE or have a free component. Online OCR services Desktop software
Quick links: click to jump to our recommendation for online OCR services and desktop OCR software. Also, see our recommendations for better OCR results.
Google Docs conversion works pretty good, especially with English texts. Over 30 different languages can be selected but if your language is not included in the list, the conversion may give an error and the file will not be processed. Of course if you dont have a Google account you can create one any time. Input image file types: most bitmap formats Input PDF files: yes Output file types: ODT, PDF, TXT, RTF, DOC, HTML Languages: 30+ Google Docs / PROS: Unlimited processing capacity CONS: Text in some minor languages may not be recognized
Free online OCR web page is more thoroughly reviewed in freewaregenius.com. Input image file types: GIF, BMP, JPEG, TIFF, PNG Input PDF files: yes Output file types: DOC, PDF, RTF, TXT Languages: English dictionary only Free Online OCR / PROS: No capacity limits for processing Keeps original formatting and Layout CONS: Only English dictionary supported. Text in other languages may be not recognized
Input image file types: TIF, JPEG, PNG, BMP, GIF, PBM, PGM, PPM Input PDF files: no Output file types: TXT languages: 30+ i2OCR/ PROS: No limits for uploading Has a review option after character recognition the original image and result text is shown sideCONS: Only text output, all the original formatting will be lost. Though at least it supports multi column pages correctly.
by-side on screen.
Creates hard linebreaks at the end of each line. Does not process PDF files.
Input image file types: JPG, TIFF, PNG, GIF Input PDF files: yes Output file types: TXT, PDF, RTF, DOC Languages: 150+ OCRonline/ PROS: Excellent recognition quality Rebuilds original formatting Impressive list of 150 language dictionaries CONS: Limited upload capacity 5 pages in a week, file size up to 10 MB. Need to pay to get extra pages.
Input image file types: JPG, JPEG, BMP, TIFF, GIF Input PDF files: only for registered users Output file types: DOC, XLS, TXT (+ PDF for registered users) Languages: 30+ Note: There is registered and guest mode available for this site. In guest mode 15 images per hour can be processed and maximum file size is 4 MB. There are some extra possibilities in registered mode, like uploading larger images, ZIP archives and multi-page PDFs. Initial credits after registering is for converting 20 pages. Online OCR / PROS: Supports some languages that other servers do not support. CONS: Limited upload capacity. Extra capacity may be purchased or earned by bonus program.
pages per week. If you need more capacity, try the other providers as they also may give good results depending on your source text.
This is another of the programs that uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google. License: freeware Requires: Microsoft .NET
Input image: TIFF, multi-page TIFF Input PDF: yes Scanner input: yes Output: TXT Dictionary languages: 9 FreeOCR / PROS: Tesseract OCR engine has good accuracy. CONS: Only text output, no formatting recognition No multi-column support (must crop the image manually to one column)
gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here. License: freeware (GNU) Requires: Tesseract, need to download separately Input PDF: yes Dictionary languages: many, uses freely downloadable OpenOffice spellcheckers Scanner input: yes Input image: JPEG, GIF, PNG, TIFF Output: TXT gImageReader / PROS: Tesseract OCR engine has good accuracy OCR area(s) can be manually selected CONS: Only text output, no formatting recognition
Puma.NET is actually not a user solution but a development kit based on CuneiForm OCR engine, though it contains a sample program with the front-end. After installing there will be no launch icon in Start Menu but you can find the program Puma.Net.Sample.exe deep in the C:\ Program Files\ Puma.NET\ Sample\ bin\ x86\ Debug\folder.
License: freeware (BSD) Requires: Microsoft .NET Input image: BMP, GIF, EXIG, JPG, PNG and TIFF Input PDF: no Scanner input: no Output: TXT, RTF, HTML Dictionary languages: 27 Puma.NET / PROS: Font and formatting detection CONS: You have to create the shortcut to start the program by yourself Leaves hard linebreaks
SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document. License: free for all non-commercial purposes Input image: TIFF, JPG, BMP Input PDF: no Scanner input: yes Output: DOC, TXT Dictionary languages: 3 Note: SimpleOCR seems to give better results from color JPEGs, not grayscale. SimpleOCR / PROS: Word by word text revision Ability to train the engine to use specific fonts Includes both single file and batch of files processing mode CONS: Only 3 languages dictionary. No font and format detection
From the desktop OCR software reviewed above provided good accuracy with different fonts including artistic. Having said that, most of the programs performed also good processing text with simple fonts.
About Priit Priit Lilleleht has written 4 post for this blog. Share the love: