Convert scanned images of Chinese documents to real, searchable, editable text.
There is some information for OCR options on Ubuntu/Linux, but it doesn’t explain the set up for Chinese text very well. OCRFeeder can be installed from the Ubuntu Software Center (Applications > Ubuntu Software Center – click on Office). OCRFeeder works as a graphical front end for OCR engines like Tesseract that do the actual optical character recognition. Tesseract provides files for language specific OCR on their downloads page. For Chinese, these are
chi_sim.traineddata.gz for traditional and simplified Chinese respectively.
- Download the files and gunzip them.
- Move them to the
tessdatadirectory. For me the path is
- Start OCRFeeder.
- Open the OCR Engines dialog ( Tools > OCR Engines).
- Click “Add”, and fill in the fields as follows:
- Name: Tesseract – Traditional Chinese
- Image format: TIFF
- Failure string: (leave blank)
- Engine path: /usr/local/bin/tesseract (or whatever the path is for your tesseract installation)
- Engine arguments: $IMAGE $FILE -l chi_tra; cat $FILE.txt; rm $FILE
- That was for traditional Chinese. For simplified Chinese, add another engine. The following fields will be different:
- Name: Tesseract – Simplified Chinese
- Engine arguments: $IMAGE $FILE -l chi_sim; cat $FILE.txt; rm $FILE
It should now be possible to select either form of Chinese when performing OCR.