88889 as of Unicode 11, minus indecomposable characters Note: New items will need to be created for components that are not yet encoded in Unicode. Examples of such components can be found in IRG Working Set 2015 [1]
Comment @GZWDer: Ideographic description sequences is useful to describe the glyphs of Han characters. However, I think, in Wikidata, we should store this kind of data as items (with a qualifier series ordinal (P1545)), rather than strings. By doing so, we can easily generate the composition trees of Han characters. What do you think? --Okkn (talk) 18:06, 16 July 2018 (UTC)[reply]
@GZWDer: I know that. But the sequence for the "箱", for example, is "⿱𥫗相", not "⿱𥫗(⿰木目)", right? Or both two sequences should be stored in this property?--Okkn (talk) 18:17, 16 July 2018 (UTC)[reply]
A character may have more than one ideographic description sequences, but the shortest possible Ideographic Description Sequence is preferred. Note Unicode Standard does not define equivalence for two Ideographic Description Sequences that are not identical. See page 424 of [2].--GZWDer (talk) 18:23, 16 July 2018 (UTC)[reply]
I know IDSes. Ok, I also think we should only store the shortest one. In that case, shouldn't we link from "箱" to "相" or from "相" to "木"? I want to draw a tree like Figure 1 on the page 131 of [3]. --Okkn (talk) 18:33, 16 July 2018 (UTC)[reply]
Then we will create a lot of items about Han character components. (Yes, [4] includes unencoded components.) How to deal with things like ⿰氵每 (China, Hong Kong, Taiwan, South Korea, Vietnam); ⿰氵毎 (Japan)?--GZWDer (talk) 21:57, 16 July 2018 (UTC)[reply]
If we import all Han characters encoded in Unicode 11, the number of components is much lower than that. I think that it's not particularly a problem. Also, in some environment, even the encoded components can't be displayed correctly. So using items brings considerable benefits to everyone.
Oppose (changed to Support) There are many mistakes in external lists such as [5]. This information is readily available on English Wiktionary but many errors exist due to mass copying without proper checking. The use of ideographic description character (Q55589899) is more appropriate for characters that are not yet encoded in Unicode. As for characters that have already been encoded, character forms or glyph shapes are better depicted in graphical form, eg. as images or by referring to GlyphWiki ID (P5467). Also, not all characters can be represented by the proposed property, eg. 慶 (inner glyph cannot be decomposed). In addition, there are many inconsistencies in Unicode encoding. Take a look at the derived characters of 兹 as an example. Also, in some cases this data is better described in textual form rather than in the form of structured data to be understood by humans. See 凞 as an example. KevinUp (talk)
I'm aware that this data is especially useful for generating composition trees of characters but I regret to say that there are too many inconsistencies in Unicode, especially for archaic and rarely used characters. This sort of data needs to be hand sorted to be meaningful and free from error. Information on composition trees can be partially obtained from English Wiktionary by referring to the section on "derived characters". See 𦭝 as an example. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
Also, the selection of which glyph to be used to represent a character can be highly subjective due to ambiguous forms. See 頋 (formed by minor variations between 㔾/巳) as an example. Also, note that we are unable to create separate items for CJK Compatibility Ideographs (Q2493848). Perhaps you could provide an example based on 殺 to illustrate the use of this property. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
@KevinUp: I can understand your concern, but without this property, we cannot link characters that can be clearly represented by this property (ex. 相 → 木 and 目). We don't have to compulsory decompose all of the Han characters in Wikidata. A property is just a medium for the expression, and I think linking items is the most important thing in graph databases. Even if it contains mistakes or inconsistent values, it is more useful than no links. In addition, it may be easier to find errors in linked data than in unstructured text. --Okkn (talk) 15:51, 20 July 2018 (UTC)[reply]
@Okkn: I think I will support this proposal under two conditions: (1) No "robot and gadget jobs" involved. All values have to be entered manually. (2) Impose a limitation, ie. this property can only be used for characters that have only one unique composition. This can be done by renaming the property as something like "unique IDS". This way, simple characters such as 相 → 木 and 目 can be decomposed while complex Han characters that have more than one IDS are not allowed to use this property. I have added the link to [7] which has less errors compared to [8]. KevinUp (talk) 16:13, 20 July 2018 (UTC)[reply]
As for the first point, I don't plan to conduct robot jobs related to this property and I also think we should not import IDSes automatically from the sources above. (But it is technically impossible to prevent other people from doing so...) As for the second, I think we can allow mulitiple IDSes only when the character has CJKV variants (like the above example of "海"). --Okkn (talk) 17:28, 20 July 2018 (UTC)[reply]
@KevinUp: Note I have said "minus indecomposible characters" in the property proposal. about 凞 I have proposed two layouts (warning: collapsed by default as it is very long)
I think this property is useful even for characters that have already been encoded as it provides a structured description of character and it is easy to query for derived characters from its part.--GZWDer (talk) 20:58, 20 July 2018 (UTC)[reply]
I wonder whether there are cases where we need to separate shinjitai and kyujitai on IDSes. But using only applies to jurisdiction (P1001) as a separator may be reasonable.
From the point of view of Japanese, it is very strange that Katakana (Q2493938) (ヨ) can be the parts of Han characters. I prefer creating new “Chinese character component” items to using extraneous symbols. --Okkn (talk) 13:17, 21 July 2018 (UTC)[reply]
Yes, it will be better to create new items to represent unencoded parts rather than using alternative shortcuts such as Katakana (Q2493938). No, it is not common to use katakana as part of IDSes. I just realized that the typographic style used for katakana in Ming (Q1071487) fonts in Japan have a closer resemblance to 宋朝体 (commonly used in educational materials) rather than 明朝体. Because most computer systems use ゴシック体/丸ゴシック体 rather than 明朝体 on web browsers, I had assumed that the typography of katakana would be identical with the typography used by Han characters in 明朝体. I now realize that they are different. Thanks for pointing that out to me. On the other hand, usage of CJK Strokes (Q2493860) in IDSes is much more common as some components such as ㇉ is not part of CJK unified ideograph (Q796156) (Note that ㇉ has no meaning on its own). KevinUp (talk) 16:47, 21 July 2018 (UTC)[reply]
Perhaps an additional qualifier might be needed to indicate that 雨 (Q3595028) used in example 6 can be further decomposed based on the jurisdiction where the glyph is used. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]