The version of FontForge currently available from the Ubuntu repositories is unusable with Ubuntu 11.10. The bug and the solution is described here. It seems that FontForge has been fixed to work with 11.10, but that the Ubuntu repositories have not been updated. Version 20110225 works; the older version 20100429 in the repository doesn’t.
Aleksandr’s solution works:
- Uninstall the old FontForge package (using Synpatic or whatever).
- Install git and python-dev if they are not already installed.
- Download and compile FontForge from source:
Open a terminal screen, and enter the following lines:
git clone git://fontforge.git.sourceforge.net/gitroot/fontforge/fontforge.
That downloads the source code for the latest version.
That moves you into the directory where the FontForge files have just landed.
sudo make install
That compiles FontForge. The executable was at
~/fontforge/fontforge/fontforge. Change the permissions to run it without
- place all JPEG files to be converted into a single directory
convert *.jpg my_pdf_file.pdf
The order of the pages will match the order of the original file-names. Conversion can be slow and resource-intensive if the number of JPEG files is large (ca. 100). The resulting PDF file is typically about the same size as the sum of the original file sizes.
Convert scanned images of Chinese documents to real, searchable, editable text.
There is some information for OCR options on Ubuntu/Linux, but it doesn’t explain the set up for Chinese text very well. OCRFeeder can be installed from the Ubuntu Software Center (Applications > Ubuntu Software Center – click on Office). OCRFeeder works as a graphical front end for OCR engines like Tesseract that do the actual optical character recognition. Tesseract provides files for language specific OCR on their downloads page. For Chinese, these are
chi_sim.traineddata.gz for traditional and simplified Chinese respectively.
- Download the files and gunzip them.
- Move them to the
tessdata directory. For me the path is
- Start OCRFeeder.
- Open the OCR Engines dialog ( Tools > OCR Engines).
- Click “Add”, and fill in the fields as follows:
- Name: Tesseract – Traditional Chinese
- Image format: TIFF
- Failure string: (leave blank)
- Engine path: /usr/local/bin/tesseract (or whatever the path is for your tesseract installation)
- Engine arguments: $IMAGE $FILE -l chi_tra; cat $FILE.txt; rm $FILE
- That was for traditional Chinese. For simplified Chinese, add another engine. The following fields will be different:
- Name: Tesseract – Simplified Chinese
- Engine arguments: $IMAGE $FILE -l chi_sim; cat $FILE.txt; rm $FILE
It should now be possible to select either form of Chinese when performing OCR.