OPTICAL CHARACTER RECOGNITION FOR SINHALA

Optical Character Recognition (OCR) for printed Sinhala documents based upon tesseract 3.01 trained by Software

Development Unit of University of Colombo School of Computing (UCSC).

The process of OCR is divided into 2 stages; input and processing (handled by Tesseract OCR engine) and post processing Engine which was developed at UCSC.

Please note, Files loaded are retained with output for future development of the OCR.

At present, OCR will facilitate only single column, printed JPEG files. More image file formats and features will follow subsequently.

Standalone OCR is available soon with Document Management System which was developed by UCSC.

Click here to visit the OCR for Sinhala website