Acceptance Criteria:
- The new Wikimedia OCR should accept Tesseract options through the API like: multiple languages, PSM, Engine.
Acceptance Criteria:
The multiple languages part of this will be dealt with in T280214 (because the lang list is common to both engines). We might want to still do some per-engine verification of the language codes though.
PR merged: https://github.com/wikimedia/wikimedia-ocr/pull/22
Note that depending on what options you choose, you might get errors about an invalid DPI (dots per inch). In production/staging this will display as a 500 error page. We're not really sure what conditions require you to set the DPI, and in my testing even when I did I would still sometimes get the same error, so we're omitting a DPI option for the time being. See discussion at https://github.com/wikimedia/wikimedia-ocr/pull/22#discussion_r625531031
I've tested some combinations of the Tesseract options via the UI. I see variations in the returned OCR text, so I guess that means the options are being passed to Tesseract.
For example, compare:
Test environment: https://ocr-test.wmcloud.org Version 0.2.0
Testing on the above and switching the tesseract PSM options yileded 500 errors
https://ocr-test.wmcloud.org/?image=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F1%2F1e%2FThe_Book_of_Scottish_Song.djvu%2Fpage20-1024px-The_Book_of_Scottish_Song.djvu.jpg&engine=tesseract&psm=3&oem=3
https://ocr-test.wmcloud.org/?image=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F1%2F1e%2FThe_Book_of_Scottish_Song.djvu%2Fpage20-1024px-The_Book_of_Scottish_Song.djvu.jpg&engine=tesseract&psm=12&oem=3