## SSSOC *niu*

The GY data in the SSSOC database currently includes 3883 records in the gy_niu table, representing *niu* (or “homophone groups”) in the GY. Four of these are peculiar in that they don’t have *fanqie* spellers.

SELECT graph FROM gy_niu WHERE fanqie_1 = '' OR fanqie_2 = ''

The four *niu *are: 拯, 𪒠, 不 and 丟. The first two reflect exceptional GY *niu *notation, in which no *fanqie* spelling is given, but instead a matching pronunciation in a different tone. For example, the entry for 拯 reads:

救也，助也。無韻切。音蒸上聲。五。

To aid, to assist. There is no rhyme-cutting [i.e. fanqie spelling]. Pronounced as the rising tone of 蒸 [a level tone word]. Five [words in this homophone group].

For the latter two, something else is going on. These are among the 9 homophone groups which the RhymeDict project gave a distinctive id > 4000: 攮, 偌, 搇, 攛, 不, 丟, 岔, 韜, 戤. These all appear to be homophone groups that are not in fact represented by the GY but which have been included for some other reason. I don’t know why two have no *fanqie* spellings, while the rest do. Some of their id numbers also appear as the foreign key value for entries in the gy_entries table. But these entries also seem to be missing from the GY. Given their strange status, we need to remember to handle them cautiously when writing queries, until we understand what they are for.

With only these exceptions, all rows of the gy_niu table have non-empty values in the fanqie_1 and fanqie_2 fields, each consisting of a single (multi-byte) character.

## Modelling *xi lian fa *系聯法

The *xi lian *method can be applied using either the first (initial) or second (final) *fanqie *speller. Although there are some distinct issues involved in each, a pure *xi lian *method is the same for either. Here we will assume that the method is being applied using the first speller. For the time being we are considering only the *fanqie *spellings that the GY uses in it’s standard formula for each homophone group.There are, of course, other *fanqie* in the GY used to indicate additional pronunciations of individual characters. We may attempt to incorporate those subsequently, but for now we set them aside. Thus, ideally speaking, since each *niu *has a *fanqie** *spelling, and since (we assume) each *fanqie* speller appears in the GY, and therefore can be assigned to a *niu,* the set of *fanqie *spellings in the GY can be thought of as a directed graph, with *niu* as nodes, and spellings as edges connecting nodes.

To illustrate, consider 東, the first *niu *in the GY (id=1 in the SSSOC db). Its first (initial) *fanqie *speller is 德 (德紅切). The graph 德 belongs to the 德 *niu* (id=3725). Thus, thinking in terms of the graph, we have an edge from node 1 to node 3725. (We use id numbers for *niu* since they are unique to the *niu*, while a single graph may appear in more than one *niu.*) Continuing, we note that the 德 *niu* is spelled 多則切, and that 多 is spelled 得何切. Since 得 is in the 德 *niu*, our directed graph has a cycle.

Having entered a cycle, we cannot extend the graph any further by this method. However we can work backwards and ask, “which *niu *are spelled with (say) 多?”

SELECT graph FROM gy_niu WHERE fanqie_1 = "多"

In addition to 德, which we have already found, there are 9 more *niu *spelled with 多: 端, 董, 等, 亶, 典, 黨, 等, 點, 涷 and 弔. If we were to add these to the graph, there would be an edge from each of them to 多. Both these procedures can be applied recursively to the newly discovered nodes. When these procedures stop yielding any new nodes, we will have found something that we might hope to be the set of all GY *niu* with the same initial type as the syllable with which we seeded the process, namely 東. There are some complications however.

The first is the definition of “initial”. Exactly how much of the syllable is indicated by a GY *fanqie* character? We won’t worry about this for now.

The second complication is that a set of GY *niu* (syllables) with the same initial may correspond to two or more disjoint graphs generated by the above procedure, disjoint graphs which just happen not to have been linked together by an appropriate *fanqie* spelling, for no linguistically significant reason. We would then need to find some other source of evidence to hook these together.

A third complication concerns the possibility that in the GY data there are examples of *niu* that with incorrect *fanqie* spellings. These will link two *niu* with differing initials. This in turn will join the entire graphs that correspond to distinct initial categories. There are various ways in which we could protect against this. One way might be to assume that such category-crossing *fanqie* are much rarer than those those that don’t cross categories. If that is so, sub-graphs that correspond to distinct initials, if connected to one another at all, are likely to be only weakly connected. Graph theoretic bridges, for example, might be likely candidates for category-crossing *fanqie* spellings.

A fourth complication is that of *fanqie* characters that cannot be found in the GY. For example, the RhymeDict *niu* data uses 厠 as the *fanqie* speller for the 㔍 *niu*. This matches the appearance of the character used for the *fanqie *in the GY entry for 㔍 in the printed edition we are using. However, there is no entry for 厠 in the GY, only for the variant 廁. This means that an unsupervised *xilian *algorithm would fail at this point. Altogether there are 22 *niu *where this problem occurs, and 11 *fanqie* initial spellers are implicated.

SELECT fanqie_1, COUNT(fanqie_1) FROM gy_niu LEFT JOIN gy_entries ON fanqie_1 = gy_entries.graph WHERE gy_entries.graph IS NULL AND fanqie_1 !="" GROUP BY fanqie_1

These initial spellers are (with the variants used as head-words in the GY following in parentheses): 厠, 呂 (吕), 奇 (竒), 妳 (㚷), 姊 (𡛷), 愽 (博 – here the RhymeDict data appears to be in error), 犲 (豺), 疏 (䟽), 禇 (褚), 辿 (the RhymeDict data has this as the initial speller for 顡 , but the GY has 他, with the variant reading 五), and 青 (靑).

The solution to this problem is to edit the contents of the gy_niu.fanqie_1 field so that it matches the character that actually occurs in the GY.