Chinese OCR on Ubuntu Linux

Convert scanned images of Chinese documents to real, searchable, editable text.

There is some information for OCR options on Ubuntu/Linux, but it doesn’t explain the set up for Chinese text very well.  OCRFeeder can be installed from the Ubuntu Software Center (Applications > Ubuntu Software Center – click on Office). OCRFeeder works as a graphical front end for OCR engines like Tesseract that do the actual optical character recognition. Tesseract provides files for language specific OCR on their downloads page. For Chinese, these are chi_tra.traineddata.gz and chi_sim.traineddata.gz for traditional and simplified Chinese respectively.

  1. Download the files and gunzip them.
  2. Move them to the tessdata directory. For me the path is /usr/local/share/tessdata/.
  3. Start OCRFeeder.
  4. Open the OCR Engines dialog ( Tools > OCR Engines).
  5. Click “Add”, and fill in the fields as follows:
    • Name: Tesseract – Traditional Chinese
    • Image format: TIFF
    • Failure string: (leave blank)
    • Engine path: /usr/local/bin/tesseract (or whatever the path is for your tesseract installation)
    • Engine arguments: $IMAGE $FILE -l chi_tra; cat $FILE.txt; rm $FILE
  6. That was for traditional Chinese. For simplified Chinese, add another engine. The following fields will be different:
    • Name: Tesseract – Simplified Chinese
    • Engine arguments: $IMAGE $FILE -l chi_sim; cat $FILE.txt; rm $FILE

It should now be possible to select either form of Chinese when performing OCR.

2 thoughts on “Chinese OCR on Ubuntu Linux

  1. Marijn

    Thanks for posting this. However, when I follow it, ocrfeeder fails to recognize characters and spits out the following to the console (repeatedly):

    Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/chi_tra.unicharset

    cat: /tmp/tmpGo9cLS.txt: No such file or directory

    I’ve been unable to find a file chi_tra.unicharset . Did the interface to tesseract change, or am I missing something trivial?

    Thanks,
    Marijn

    Reply
  2. Aharon

    in my case i have trouble to adding OCR Engines to OCRFeeder.
    and after installing leptonica-progs
    “sudo apt-get install leptonica-progs”
    all work’s.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>