Thursday, March 28, 2013

Ubuntu OCR Solution.

Since I am transferred to NOIDA office of the company there is mostly office work for me except occasional visit to a power station.

I was looking for OCR solution to convert scanned PDF documents to text files. Initially I tried pdfocr and tesseract command line tools but not much success.

Then I converted one page online at ABBYY FineReader The site allowed me to convert 3 pages for free and afterwards I had to pay.

I discovered this page about Linux OCR solution. I downloaded the .deb file and installed on Ubuntu 12.04. It installed without any dependency problem since I had tesseract already installed.

Actually Lios is a GUI using cuneiform/tesseract engine in the background. I had already tried pdfocr which uses cuneiform and tesseract through command line and not hoping to get good results but Lios worked much better.

I used cuneiform engine for normal scanned page and tesseract engine if there was a table on the page. It takes time if there is a table but tesseract extracts the text correctly.