All of the interesting technological, artistic or just plain fun subjects I'd investigate if I had an infinite number of lifetimes. In other words, a dumping ground...

Wednesday 26 September 2007

OCR optical character recognition

A lot of the hassle can be avoided by using gscan2pdf's OCR feature with
tesseract.
http://gscan2pdf.sourceforge.net/


Gscan2pdf does all the image conversion in the background, and it will also
use whatever language data files are available to tesseract. I've tested
with texts in English and French, and the OCR results are surprisingly
good. There's no comparison at all with gocr. I use lineart scans at 300 or
600 dpi. It also helps to clean up the text with unpaper, another task made
easy by gscan2pdf.
http://unpaper.berlios.de/


Also, while the OCR output can be copied and pasted elsewhere, gscan2pdf
will automatically attach it as an annotation when generating a PDF. This
means that PDFs containing images of scanned paper can be indexed and
located with desktop search.


In addition to serving as a frontend for tesseract and unpaper, gscan2pdf
has other features that make it a useful all-around scanning/scanned paper
utility (ADF support, PDF import, export to tiff and djvu, thumbnails for
easy page reordering, rotation, and so on). I'm just a very happy user of
this application, which deserves more attention than it gets, I think.

No comments:

tim's shared items

Blog Archive

Add to Google Reader or Homepage