Tag Archives: SSSOC

SSSOC – GY data – modelling xilian cateogories


The GY data in the SSSOC database currently includes 3883 records in the gy_niu table, representing niu (or “homophone groups”) in the GY. Four of these are peculiar in that they don’t have fanqie spellers.

SELECT graph FROM gy_niu 
WHERE fanqie_1 = '' OR fanqie_2 = ''

The four niu are: , 𪒠, 不 and 丟. The first two reflect exceptional GY niu notation, in which no fanqie spelling is given, but instead a matching pronunciation in a different tone. For example, the entry for 拯 reads:


To aid, to assist. There is no rhyme-cutting [i.e. fanqie spelling]. Pronounced as the rising tone of 蒸 [a level tone word]. Five [words in this homophone group].

For the latter two, something else is going on. These are among the 9 homophone groups which the RhymeDict project gave a distinctive id > 4000: 攮, 偌, 搇, 攛, 不, 丟, 岔, 韜, 戤. These all appear to be homophone groups that are not in fact represented by the GY but which have been included for some other reason. I don’t know why two have no fanqie spellings, while the rest do. Some of their id numbers also appear as the foreign key value for entries in the gy_entries table. But these entries also seem to be missing from the GY. Given their strange status, we need to remember to handle them cautiously when writing queries, until we understand what they are for.

With only these exceptions, all rows of the gy_niu table have non-empty values in the fanqie_1 and fanqie_2 fields, each consisting of a single (multi-byte) character.

Modelling xi lian fa 系聯法

The xi lian method can be applied using either the first (initial) or second (final) fanqie speller. Although there are some distinct issues involved in each, a pure xi lian method is the same for either. Here we will assume that the method is being applied using the first speller. For the time being we are considering only the fanqie spellings that the GY uses in it’s standard formula for each homophone group.There are, of course, other fanqie in the GY used to indicate additional pronunciations of individual characters. We may attempt to incorporate those subsequently, but for now we set them aside. Thus, ideally speaking, since each niu has a fanqie spelling, and since (we assume) each fanqie speller appears in the GY, and therefore can be assigned to a niu, the set of  fanqie spellings in the GY can be thought of as a directed graph, with niu as nodes, and spellings as edges connecting nodes.

To illustrate, consider 東, the first niu in the GY (id=1 in the SSSOC db). Its first (initial) fanqie speller is 德 (德紅切). The graph 德 belongs to the 德 niu (id=3725). Thus, thinking in terms of the graph, we have an edge from node 1 to node 3725. (We use id numbers for niu since they are unique to the niu, while a single graph may appear in more than one niu.) Continuing, we note that the 德 niu is spelled 多則切, and that 多 is spelled 得何切. Since 得 is in the 德 niu, our directed graph has a cycle.

 Having entered a cycle, we cannot extend the graph any further by this method. However we can work backwards and ask, “which niu are spelled with (say) 多?”

SELECT graph FROM gy_niu WHERE fanqie_1 = "多"

In addition to 德, which we have already found, there are 9 more niu spelled with 多: 端, 董, 等, 亶, 典, 黨, 等, 點, 涷 and 弔. If we were to add these to the graph, there would be an edge from each of them to 多. Both these procedures can be applied recursively to the newly discovered nodes. When these procedures stop yielding any new nodes, we will have found something that we might hope to be the set of all GY niu with the same initial type as the syllable with which we seeded the process, namely 東. There are some complications however.

The first is the definition of “initial”. Exactly how much of the syllable is indicated by a GY fanqie character? We won’t worry about this for now.

The second complication is that a set of GY niu (syllables) with the same initial may correspond to two or more disjoint graphs generated by the above procedure, disjoint graphs which just happen not to have been linked together by an appropriate fanqie spelling, for no linguistically significant reason. We would then need to find some other source of evidence to hook these together.

A third complication concerns the possibility that in the GY data there are examples of niu that with incorrect fanqie spellings. These will link two niu with differing initials. This in turn will join the entire graphs that correspond to distinct initial categories. There are various ways in which we could protect against this. One way might be to assume that such category-crossing fanqie are much rarer than those those that don’t cross categories. If that is so, sub-graphs that correspond to distinct initials, if connected to one another at all, are likely to be only weakly connected. Graph theoretic bridges, for example, might be likely candidates for category-crossing fanqie spellings.

A fourth complication is that of fanqie characters that cannot be found in the GY. For example, the RhymeDict niu data uses 厠 as the fanqie speller for the niu. This matches the appearance of the character used for the fanqie in the GY entry for 㔍 in the printed edition we are using. However, there is no entry for 厠 in the GY, only for the variant 廁. This means that an unsupervised xilian algorithm would fail at this point. Altogether there are 22 niu where this problem occurs, and 11 fanqie initial spellers are implicated.

SELECT fanqie_1, COUNT(fanqie_1) 
FROM gy_niu
LEFT JOIN gy_entries 
ON fanqie_1 = gy_entries.graph 
WHERE gy_entries.graph IS NULL AND fanqie_1 !="" 
GROUP BY fanqie_1

These initial spellers are (with the variants used as head-words in the GY following in parentheses): 厠, 呂 (吕), 奇 (竒), 妳 (㚷), 姊 (𡛷), 愽 (博 – here the RhymeDict data appears to be in error), 犲 (豺), 疏 (䟽), 禇 (褚), 辿 (the RhymeDict data has this as the initial speller for 顡 , but the GY has 他, with the variant reading 五), and 青 (靑).

The solution to this problem is to edit the contents of the gy_niu.fanqie_1 field so that it matches the character that actually occurs in the GY.