SSSOC – GY data – modelling xilian cateogories

SSSOC niu

The GY data in the SSSOC database currently includes 3883 records in the gy_niu table, representing niu (or “homophone groups”) in the GY. Four of these are peculiar in that they don’t have fanqie spellers.

SELECT graph FROM gy_niu 
WHERE fanqie_1 = '' OR fanqie_2 = ''

The four niu are: , 𪒠, 不 and 丟. The first two reflect exceptional GY niu notation, in which no fanqie spelling is given, but instead a matching pronunciation in a different tone. For example, the entry for 拯 reads:

救也,助也。無韻切。音蒸上聲。五。

To aid, to assist. There is no rhyme-cutting [i.e. fanqie spelling]. Pronounced as the rising tone of 蒸 [a level tone word]. Five [words in this homophone group].

For the latter two, something else is going on. These are among the 9 homophone groups which the RhymeDict project gave a distinctive id > 4000: 攮, 偌, 搇, 攛, 不, 丟, 岔, 韜, 戤. These all appear to be homophone groups that are not in fact represented by the GY but which have been included for some other reason. I don’t know why two have no fanqie spellings, while the rest do. Some of their id numbers also appear as the foreign key value for entries in the gy_entries table. But these entries also seem to be missing from the GY. Given their strange status, we need to remember to handle them cautiously when writing queries, until we understand what they are for.

With only these exceptions, all rows of the gy_niu table have non-empty values in the fanqie_1 and fanqie_2 fields, each consisting of a single (multi-byte) character.

Modelling xi lian fa 系聯法

The xi lian method can be applied using either the first (initial) or second (final) fanqie speller. Although there are some distinct issues involved in each, a pure xi lian method is the same for either. Here we will assume that the method is being applied using the first speller. For the time being we are considering only the fanqie spellings that the GY uses in it’s standard formula for each homophone group.There are, of course, other fanqie in the GY used to indicate additional pronunciations of individual characters. We may attempt to incorporate those subsequently, but for now we set them aside. Thus, ideally speaking, since each niu has a fanqie spelling, and since (we assume) each fanqie speller appears in the GY, and therefore can be assigned to a niu, the set of  fanqie spellings in the GY can be thought of as a directed graph, with niu as nodes, and spellings as edges connecting nodes.

To illustrate, consider 東, the first niu in the GY (id=1 in the SSSOC db). Its first (initial) fanqie speller is 德 (德紅切). The graph 德 belongs to the 德 niu (id=3725). Thus, thinking in terms of the graph, we have an edge from node 1 to node 3725. (We use id numbers for niu since they are unique to the niu, while a single graph may appear in more than one niu.) Continuing, we note that the 德 niu is spelled 多則切, and that 多 is spelled 得何切. Since 得 is in the 德 niu, our directed graph has a cycle.

 Having entered a cycle, we cannot extend the graph any further by this method. However we can work backwards and ask, “which niu are spelled with (say) 多?”

SELECT graph FROM gy_niu WHERE fanqie_1 = "多"

In addition to 德, which we have already found, there are 9 more niu spelled with 多: 端, 董, 等, 亶, 典, 黨, 等, 點, 涷 and 弔. If we were to add these to the graph, there would be an edge from each of them to 多. Both these procedures can be applied recursively to the newly discovered nodes. When these procedures stop yielding any new nodes, we will have found something that we might hope to be the set of all GY niu with the same initial type as the syllable with which we seeded the process, namely 東. There are some complications however.

The first is the definition of “initial”. Exactly how much of the syllable is indicated by a GY fanqie character? We won’t worry about this for now.

The second complication is that a set of GY niu (syllables) with the same initial may correspond to two or more disjoint graphs generated by the above procedure, disjoint graphs which just happen not to have been linked together by an appropriate fanqie spelling, for no linguistically significant reason. We would then need to find some other source of evidence to hook these together.

A third complication concerns the possibility that in the GY data there are examples of niu that with incorrect fanqie spellings. These will link two niu with differing initials. This in turn will join the entire graphs that correspond to distinct initial categories. There are various ways in which we could protect against this. One way might be to assume that such category-crossing fanqie are much rarer than those those that don’t cross categories. If that is so, sub-graphs that correspond to distinct initials, if connected to one another at all, are likely to be only weakly connected. Graph theoretic bridges, for example, might be likely candidates for category-crossing fanqie spellings.

A fourth complication is that of fanqie characters that cannot be found in the GY. For example, the RhymeDict niu data uses 厠 as the fanqie speller for the niu. This matches the appearance of the character used for the fanqie in the GY entry for 㔍 in the printed edition we are using. However, there is no entry for 厠 in the GY, only for the variant 廁. This means that an unsupervised xilian algorithm would fail at this point. Altogether there are 22 niu where this problem occurs, and 11 fanqie initial spellers are implicated.

SELECT fanqie_1, COUNT(fanqie_1) 
FROM gy_niu
LEFT JOIN gy_entries 
ON fanqie_1 = gy_entries.graph 
WHERE gy_entries.graph IS NULL AND fanqie_1 !="" 
GROUP BY fanqie_1

These initial spellers are (with the variants used as head-words in the GY following in parentheses): 厠, 呂 (吕), 奇 (竒), 妳 (㚷), 姊 (𡛷), 愽 (博 – here the RhymeDict data appears to be in error), 犲 (豺), 疏 (䟽), 禇 (褚), 辿 (the RhymeDict data has this as the initial speller for 顡 , but the GY has 他, with the variant reading 五), and 青 (靑).

The solution to this problem is to edit the contents of the gy_niu.fanqie_1 field so that it matches the character that actually occurs in the GY.

 

SSSOC – Guang yun data

Data imported from the RhymeDict project. Four tables: entries, niu, initials and finals.

Page number data imported from UniHan database. Single table with graph, page number, and number on page. The page numbers and numbers with each page need to be combined with the entries table from RhymeDict, but this is not a trivial task:

  • graphs do not provide a unique index
  • 25361 rows in entries table vs 25337 from UniHan, a difference of 24. The last 28 rows of the RhymeDict entries (and the corresponding last 9 rows of niu) seem to be some kind of supplement, perhaps for syllables missing in the dictionary.
  • There are three duplicates in the UniHan data, i.e. three pairs of graphs with
    • 曅 and 𬀽 at 540.47
    • 匨 and 𫧔 at 183.9
    • 𦶎 and 𬝨 at 93.48

One of each of the duplicates retained (first of each pair above) and second deleted.

There were a few of other missing graphs or graphs incorrectly present in one or the other source. These were hand corrected.

Guangyun 廣韻 data

Guang yun text data

The best web app displaying Guang yun data is 韻典網, providing views of both the full text, page images, and phonological structure. The site does not provide the raw data from which the views derive, but provides links to sources that do.

This project 漢字データベースプロジェクト provides data of the complete text of the Guang yun in XML format, including the (unpunctuated) text of individual entries, and the preface texts. Fan qie spellings are given by marking them up in their original positions in the text with xml tags. Editorial emendations are also marked up with tags. The text files contain many Unicode non-BMP characters, so importing this data into other formats requires that this be taken into consideration (use of utf8mb4 in MySQL, for example). GPL license.

The most refined Guang yun data, including punctuated and carefully edited text, parallels with the Qie yun 切韻 manuscripts, links to manuscript images, and a sophisticated, nice-lookinng interface is provided by this site: Web韻圖. Very impressive, and actively maintained. The underlying data is not made available except via the web interface.

There is a wealth of information on べんぞう(伊藤祥司)展開書庫のページ, but much of it is rather hard to navigate. License not clear.

For the data that underlies the 韻典網 project – a full transcription of the Guang yun text, and phonological analysis – arranged in four CSV files suitable for importing into an RDBMS, the best source is probably the rhymedict project. GPL licence. The project also provides a JavaScript app for browsing the GY locally.

The mapping of graphs to page numbers and entry numbers in the Song printing of the Guang yun is provided by  a field in the Unicode UniHan database, as described here.

Guang yun image data

Page images of Guang yun editions are available from a number of sources. The only ones released under an explicitly free license that I am aware of are the Si ku quan shu 四庫全書 editions available through archive.org. These are the 重修廣韻 and the 原本廣韻.

Libraries with publicly available digitized Guang yun editions include:

  • Waseda University – may  be downloaded as color images or PDFs
  • Tokyo University – may be downloaded as B&W PDFs
    • 1674 澤存堂 edition (apparently identical page layout to the Waseda 1704 printing.)
  • National Diet Library (viewable online in color, but no downloadable format.)
    • Said to be Song print, though no date given (apparently identical page layout to the Waseda 1704 printing.)
  • Harvard (color page scans, no option for downloading a copy for local use.)
  • IDP manuscripts – use this list for call numbers. Examples include:

 

Huge collection of images of images in Chinese temples

The Zhejiang Library has put online a huge collection of images of (mostly Buddhist) sculpture, under the heading “中国寺庙祠观造像数据库“. There’s no proper page navigation, as far as I can tell, nor any search functions. Download times also very slow. Print sources are given for each image, so presumably these are all under copyright.

Insert Zotero bibliographic data into a web-page by querying the Zotero API

  1. Figure out your Zotero user id. This is not the same as your Zotero username. Log into your Zotero account at www.zotero.org. Settings > Feeds/API. Copy the value given by “Your userID for use in API calls is … “.
  2. Construct an API query URL that gives you the bibliographic data you need in the format you want it. Documentation.
    • The URL starts with https://api.zotero.org/users/<userid>/.
    • For a single bibliographic item, add items/<itemkey>. To discover the item key, see the HowTo on this page.
    • Item example: https://api.zotero.org/users/160881/items/8HTNV32W/?v=3&format=bib&style=elsevier-harvard2. 160881 is my user id. 8HTNV32W is the item id in my Zotero database. v=3 selects the most recent API version number (i.e. ver. 3). format=bib requests that the data be formatted as an XHTML bibliography. style=elsevier-harvard2 sets the citation style. I like elsevier-harvard2 because 1/ it doesn’t pointlessly capitalize all title words, and 2/ it doesn’t italicize Chinese (or any) titles. See the result returned by this URL here.
    • Multiple item example: https://api.zotero.org/users/160881/items/?v=3&format=bib&style=elsevier-harvard2&itemKey=UDAKWEPD,6MPC3NRH,BETZ6T8M,2A44Q48D,9KDGT7VK,24MN76X3,IFITWC8S,5JKVSN7X,DSF26IDM,2W76QV2C,983IUQW6,5IBA73W6. Notice that the item keys are now a comma-separated list in the query string. Result is here.
    • Tag example: to return a bibliography of all items with a particular tag use tag= in the query string. E.g. https://api.zotero.org/users/160881/items/?v=3&format=bib&style=elsevier-harvard2&tag=OBI. Result is here. For more complocated tag queries, see the Documentation.
    • COinS: By changing the format=bib to format=coins, a set of <span>s containing COinS data is returned instead. https://api.zotero.org/users/160881/items/?v=3&format=coins&tag=OBI.
  3. Adding the bibliograpic data to a php/html webpage is simple:
    <?php 
    $url = 'https://api.zotero.org/users/160881/items/?v=3&format=coins&tag=OBI';
    $var = file_get_contents($url);
    echo $var;
    ?>
    

Get Zotero item key from Firefox

To read an item via the Zotero API, you need to have the item key. Oddly, the Zotero Firefox app does not provide simple access to the item keys. The following export translator will copy the item keys as a comma-separated list to the clipboard on Ctrl+Shift+c in the usual manner. The translator needs to be saved as a javascript file (.js file extension) in the Zotero translators directory. My translators directory was here ~/.mozilla/firefox/wgcm6dkk.default/zotero/translators/. The code is based on that found here: https://gist.github.com/nschneid/3134386, but pared down to give nothing apart from the list of item keys.

{
"translatorID":"0dbe4ec8-597c-4cc7-bfb5-c38321c5c689",
"translatorType":2,
"label":"Zotero Item Key",
"creator":"Adam Smith",
"target":"html",
"minVersion":"2.0",
"maxVersion":"",
"priority":200,
"inRepository":false,
"displayOptions":{"exportCharset":"UTF-8"},
"lastUpdated":"Fri 22 Aug 2014 12:24:20 PM EDT"
}

function doExport() {
   var item;
   if(item = Zotero.nextItem())
   {
      Zotero.write(item.key);
   }
   while(item = Zotero.nextItem())
   {
      Zotero.write(','+item.key);
   }
}

Fake Gelao 仡佬 writing system and manuscript.

I found a copy of this newly-published book in my mailbox last week.
Jing Tinghu 景亭湖. 2013. Pu Zu Jing: Han Yi Gelao Tianshu 濮祖經:漢譯仡佬天書. Beijing: Wenshi Chubanshe 文史出版社.

It purports to be a publication of a Gelao 仡佬 manuscript book – the Pu Zu Jing 濮祖經 or Classic of the Ancestors of the Pu People – with photographs of the original and a Chinese translation. The manuscript and the script do bear a passing resemblance to similar items produced by other groups in China’s linguistically diverse Southwest. Casual inspection reveals this one to be a transparent attempt at deception, interesting only because of the enthusiastic and credulous reception that it (and a previous discovery of the same kind) received from the Guangming Daily 光明日報. There is no sign that it is intended to be a joke. Here is an English version of the story.

A debunking of another “Gelao manuscript” – the Jiu Tian Da Pu Shi Lu 九天大濮史錄 – appears on this Chinese blog.

The following page appears as p. 165 of the 2013 publication:

fake_ms_crop
fake_ms_xiehewanbang

協和萬邦

This page, like all the others in the manuscript, is “translated” into Chinese in such a way that each “Gelao” graph corresponds to exactly one Chinese character, the same Chinese character each time (except for a few slips – see below). Word order is preserved. The end result makes sense in Chinese. This alone tells us that this is not a translation. It is also puzzling to find not a single mention of an actual Gelao word in the entire publication.

Furthermore, phrases straight out of Chinese literature emerge by this process. The 4-character phrase on the left, for instance, is translated as 協和萬邦. But notice that second Gelao character: elsewhere in the manuscript, it corresponds to 合 in the translation. And sure enough, the translators have rendered this 4-character phrase as 協合萬邦, not 協和萬邦. Clearly, the composers of the manuscript text were led astray by the homophony (in Mandarin!) of 和 and 合.

A related confusion occurs in the immediately previous 4-charater phrase: the phrase translated as 設立和王 is also written with the character elsewhere used for 合, not the one used for 和.

fake_ms_hewanggongyi

和王宮邑

Further on in the text on p. 165, we find the “normal” writing for 和 – the three prongs with circles on the top, as in the phrase 和王宮邑 (image on the right).

fake_ms_puwanggongdian

濮王宮殿

Evidently, the manuscript text is in Chinese not in Gelao. The Chinese translation isn’t a translation – it’s the text from which the manuscript was produced.

The “script” is clearly a set of symbols invented so as to be easy to remember by someone who knows the Chinese script. The character corresponding to 宮, for instance is clearly 宀 over 王. 邑 is immediately recognizable. 濮 (image on left) is just stripped down a little, and 殿 (left) is barely modified at all.

And the content?  I haven’t had the patience to go through it in detail, but it looks like Chinese dynastic history rehashed, with an extra role for the 僕人 as heroic ancestors of the Gelao. Plus some stuff about the smelting of silver, and cinnabar, and Laozi, and divination.

“Fake, fake, fake, fake.”

Custom fonts for unusual scripts.

The following is a simple method for building a font for an unusual script, using open source software. It could be used to design fonts for scripts that do not have a standard encoding (pre-Han Chinese scripts) or for distinctive varieties of scripts that do have a standard encoding (graph forms found in Chinese calligraphy). The example used here is the script of a 19th c. Nosu manuscript in the Penn Museum (96-17-2).

The method starts with a digital image of the text that provides the glyph exemplars. This needs to be converted to a black-and-white (i.e. 1-bit) image, in which the glyph exemplars are black and everything else white. The outlines of the black areas of the image can then be automatically traced, to provide the outlines for a digital font. These outlines can be imported into a font-editing program, to be modified as necessary, assigned to an encoding, and exported as a font file.

Tools

  1. Linux OS. The following walk-through was carried out on Ubuntu 12.04. All the tools below are simple to download and install under Ubuntu. It may be possible to adapt them to work on MS Windows or a Mac.
  2. GIMP, or some equivalent program for manipulating images.
  3. Glyphtracer, which automates the process of tracing glyph outlines (using potrace under the hood, which must be installed for Glyphtracer to work.)
  4. FontForge, a superb outline font editor.

Procedure

Since our goal here is illustrative, we will make a font consisting of just six glyphs, the six glyphs that appear in what I presume is a title in the top right-hand corner of this page. We will assign the glyphs to the same code points as lower case “a” through “f”. If text containing these six letters is displayed using the font, the glyphs will appear instead of these letters. (In general, this is not a good idea, but since it allows us to type our font using simple key-strokes, it is useful for illustrative purposes.)

1. Create B&W source image with GIMP.

96-17-2_6_croppedOpen up the image file in GIMP. We don’t need any of the Chinese text, so we can crop away everything except the top right-hand corner, giving something like the image on the right.

If we were doing this properly, we might want to compensate for the distortion due to the page not being flat when photographed. The GIMP’s “rotate”, “shear” and “scale” functions would probably be adequate for this. But we shan’t bother since the glyphs look good enough.

 

thresholdNow we need to convert this color image to a 1-bit black-and-white image. Use the GIMP’s “threshold” tool (tools > color tools > threshold on my machine) to separate the blackish ink of the glyphs from the various shades of brownish paper. The histogram in the threshold tool shows two peaks – one corresponds to the darker ink (smaller left peak) and the other to the lighter brown paper (larger right peak). By dragging the black slider to an appropriate position between the two peaks, we can make almost all of the paper white, and almost all of the ink black. The aim is to preserve the glyph outlines as accurately as possible. Any black mess that comes over from dark patches on the paper can be cleaned up in the next stage.

96-17-2_6Now we have a color image that only uses two colors (black and white). We need to convert it to a black and white (i.e. 1-bit) image. GIMP does this with Image > Mode > Indexed... > Use black and white (1-bit) palette. Now we can also use the usual GIMP tools to clean up any speckles or other mess that is interfering with the glyph outlines. This should give us something like the image on the left.

Save as a PNG file. We now have a image file that is acceptable as input to Glyphtracer.

 

2. Trace glyph outlines with Glyphtracer.

On running Glyphtracer, the first dialog screen allows you to choose the name for the font (anything will do – we’ll call ours nosu), and to select the file path for the input file (browse to that PNG file you saved in the previous step) and the output file (if you use the automatically generated path, it will end up in the same location as the input file).

glyphtracer1

Click Start.

The next screen should display the input image, with a bounding box around each glyph. The current glyph is indicated by the text in the button bar at the bottom: Glyph 1/26 a (a). If you click on any glyph in the image, its outline will be assigned to the code point for “a”. This also automatically increments the code point to “b”, reflected in the button bar text. Experiment with the three buttons at the lower left: these change the code point to which a clicked glyph will be assigned. As you assign each glyph in the image, it appears greyed out. (I haven’t found any way to unassign a code point.)

glyphtracer2

Code points available for assignment include Latin alphabet and its common extensions, numerals 1-10, common symbols, and Cyrillic. However, it is an easy task to modify the python code in the file gtlib.py to allow assignment to other ranges, such as CJK for Chinese, or the Private Use Areas.

When all glyphs have been assigned to code points, click Generate SFD file. Glyphtracer will automatically trace the outlines of the glyphs, representing them using Bézier curves. If your input file was nosu.png and you accepted the default options, the output file will be nosu.sfd and in the same directory. The SFD file format contains the numerical data representing the glyph outlines, and the mappings to code points. It is readable by the FontForge program which is the next tool we need to use.

3. Make font with FontForge.

Run FontForge. Open the SFD file saved in the previous step. The glyphs will appear in the appropriate positions.

fontforge1

Double click a glyph to edit its outline. (We won’t actually do any editing for the purposes of this walk-through, but the glyph edit window is shown below.)

fontforge2

Generate the font: File > Generate Fonts.... (I got the “missing points at extrema” and “non-integral coordinates” errors – but these are not lethal, so saved anyway. Alternatively, fix them in FontForge before saving.) The default is to save as a TrueType font, which will produce a file with a .ttf extension: nosu.ttf.

4. Install font, and try it out.

The font file that I produced is here: http://www.cangjie.info/blog/public_files/nosu.ttf

The font can be installed on any platform, including MS Windows, in the usual way. The screen shot below shows a document produced by typing “abcdef” etc. and then switching the font to nosu.ttf.

LibreOffice

Notes.

  • nosu.ttf is not a fixed-width font, as most East Asian fonts are expected to be. This could easily be changed using FontForge.
  • There are constraints on the arrangement of glyphs in the image to be read by Glyphtracer. I believe it is the case that individual glyphs must be separated by an orthogonal grid of whitespace. I.e. both horizontal and  vertical bands of white, running across the entire image, must separate rows and columns of black glyphs.
  • For a similar guide to the use of these tools, see James Wilkes Web Design. It also describes a technique for embedding fonts in web-pages, so content can be viewed without end users having to download and install the relevant fonts.
  • Youtube tutorial by the developer of Glyphtracer.