Wikidata:Property proposal/Ideographic description sequences

From Wikidata
Jump to navigation Jump to search

Ideographic description sequences

[edit]

Originally proposed at Wikidata:Property proposal/Generic

Descriptionmethod to describe composition of Han characters using ideographic description characters and character components
Data typeItem
Template parameter"ids" in wikt:template:Han char
DomainCJK characters
Allowed valuesminimum 3 items - must include instances of ideographic description character (Q55589899) () and two or more of sinogram (Q53764738), unencoded Chinese character (Q11093293) or CJK Strokes (Q2493860) with mandatory qualifier for all items: series ordinal (P1545)
Example 1(Q54874870) -> ① (Q55589918) (series ordinal (P1545): 1) ② (Q3594983) (series ordinal (P1545): 2), ③ (Q54552546) (series ordinal (P1545): 3)
IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
0 references
add reference
add value
Example 2(Q54875038) -> ① (Q55589919) (series ordinal (P1545): 1) ② 𥫗 (Q55885207) (series ordinal (P1545): 2) ③ (Q54874870) (series ordinal (P1545): 3)
IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank 𥫗
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
0 references
add reference
add value
Example 3(Q3594965) -> ① (Q55589919) (series ordinal (P1545): 1) ② (Q55589921) (series ordinal (P1545): 2) ③ (Q55890554) (series ordinal (P1545): 3) ④ (Q55906699) (series ordinal (P1545): 4) ⑤ (Q55935376) (series ordinal (P1545): 5) ⑥ (Q55851744) (series ordinal (P1545): 6)
IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
0 references
add reference
Normal rank
series ordinal 4
0 references
add reference
Normal rank
series ordinal 5
0 references
add reference
Normal rank
series ordinal 6
0 references
add reference
add value
Example 4(Q3594998) -> ① (Q55589918) (series ordinal (P1545): 1; applies to jurisdiction (P1001): mainland China (Q19188), Hong Kong (Q8646), Taiwan (Q865), South Korea (Q884), Vietnam (Q881)) ② (Q55885542) (series ordinal (P1545): 2; applies to jurisdiction (P1001): mainland China (Q19188), Hong Kong (Q8646), Taiwan (Q865), South Korea (Q884), Vietnam (Q881)) ③ (Q55868498) (series ordinal (P1545): 3; applies to jurisdiction (P1001): mainland China (Q19188), Hong Kong (Q8646), Taiwan (Q865), South Korea (Q884), Vietnam (Q881))
Example 5(Q3594998) -> ① (Q55589918) (series ordinal (P1545): 1; applies to jurisdiction (P1001): Japan (Q17)) ② (Q55885542) (series ordinal (P1545): 2; applies to jurisdiction (P1001): Japan (Q17)) ③ (Q54553315) (series ordinal (P1545): 3; applies to jurisdiction (P1001): Japan (Q17))
IDSes
Normal rank
series ordinal 1
applies to jurisdiction Japan
0 references
add reference
Normal rank
series ordinal 2
applies to jurisdiction Japan
0 references
add reference
Normal rank
series ordinal 3
applies to jurisdiction Japan
0 references
add reference
add value
Example 6(Q3595029) -> ① (Q55589919) (series ordinal (P1545): 1; applies to jurisdiction (P1001): Taiwan (Q865), Hong Kong (Q8646), South Korea (Q884)) ② (Q3595028) (series ordinal (P1545): 2; applies to jurisdiction (P1001): Taiwan (Q865), Hong Kong (Q8646), South Korea (Q884)) ③ CJK Radical Snout Two (⺕) (Q55917365) (series ordinal (P1545): 3; applies to jurisdiction (P1001): Taiwan (Q865), Hong Kong (Q8646), South Korea (Q884))
Example 7(Q3595029) -> ① (Q55589919) (series ordinal (P1545): 1; applies to jurisdiction (P1001): Japan (Q17), mainland China (Q19188)) ② (Q3595028) (series ordinal (P1545): 2; applies to jurisdiction (P1001): Japan (Q17), mainland China (Q19188)) ③ (Q55917460) (series ordinal (P1545): 3; applies to jurisdiction (P1001): Japan (Q17), mainland China (Q19188))
IDSes
Normal rank
series ordinal 1
applies to jurisdiction Japan
mainland China
0 references
add reference
Normal rank
series ordinal 2
applies to jurisdiction Japan
mainland China
0 references
add reference
Normal rank
series ordinal 3
applies to jurisdiction Japan
mainland China
0 references
add reference
add value
Sourcehttp://www.chise.org/ids-find?components=$i ($i=character to be queried), https://github.com/cjkvi/cjkvi-ids, http://www.babelstone.co.uk/CJK/IDS.TXT
Number of IDs in source88889 as of Unicode 11, minus indecomposable characters Note: New items will need to be created for components that are not yet encoded in Unicode. Examples of such components can be found in IRG Working Set 2015 [1]
Expected completenesseventually complete (Q21873974)
Robot and gadget jobsNo

Motivation

This is a way to represent characters, especially unencoded characters like 𱁬 (Q7676480). Note we need a qualifier for example 3. GZWDer (talk) 17:48, 16 July 2018 (UTC)[reply]

Discussion

For (Q54875038): ① (Q55589919) (series ordinal (P1545): 1) ② 𥫗 (Q55885207) (series ordinal (P1545): 2) ③ (Q54874870) (series ordinal (P1545): 3)

IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank 𥫗
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
0 references
add reference
add value

For (Q54874870): ① (Q55589918) (series ordinal (P1545): 1) ② (Q3594983) (series ordinal (P1545): 2) ③ (Q54552546) (series ordinal (P1545): 3)

IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
0 references
add reference
add value
Then we will create a lot of items about Han character components. (Yes, [4] includes unencoded components.) How to deal with things like ⿰氵每 (China, Hong Kong, Taiwan, South Korea, Vietnam); ⿰氵毎 (Japan)?--GZWDer (talk) 21:57, 16 July 2018 (UTC)[reply]
If we import all Han characters encoded in Unicode 11, the number of components is much lower than that. I think that it's not particularly a problem. Also, in some environment, even the encoded components can't be displayed correctly. So using items brings considerable benefits to everyone.
Variations can be represented as follow:
IDSes
Normal rank
series ordinal 1
0 references
add reference
Normal rank
series ordinal 2
0 references
add reference
Normal rank
series ordinal 3
writing system traditional Chinese characters
simplified Chinese characters
kyūjitai, Hanja, etc...
0 references
add reference
Normal rank
series ordinal 3
writing system shinjitai
0 references
add reference
add value
I'm not sure whether the common parts ( and 氵) should be duplicated or not. --Okkn (talk) 05:55, 17 July 2018 (UTC)[reply]
The common parts should be listed again to make things clearer. An additional qualifier applies to jurisdiction (P1001) or writing system (P282) will be used to differentiate between the two sets of data. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
  • @GZWDer: Could you please change the datatype? Then I will support it. --Okkn (talk) 17:30, 19 July 2018 (UTC)[reply]
  • Changed. GZWDer (talk) 17:35, 19 July 2018 (UTC)[reply]
  •  Oppose (changed to  Support) There are many mistakes in external lists such as [5]. This information is readily available on English Wiktionary but many errors exist due to mass copying without proper checking. The use of ideographic description character (Q55589899) is more appropriate for characters that are not yet encoded in Unicode. As for characters that have already been encoded, character forms or glyph shapes are better depicted in graphical form, eg. as images or by referring to GlyphWiki ID (P5467). Also, not all characters can be represented by the proposed property, eg. (inner glyph cannot be decomposed). In addition, there are many inconsistencies in Unicode encoding. Take a look at the derived characters of as an example. Also, in some cases this data is better described in textual form rather than in the form of structured data to be understood by humans. See as an example. KevinUp (talk)
    • I'm aware that this data is especially useful for generating composition trees of characters but I regret to say that there are too many inconsistencies in Unicode, especially for archaic and rarely used characters. This sort of data needs to be hand sorted to be meaningful and free from error. Information on composition trees can be partially obtained from English Wiktionary by referring to the section on "derived characters". See 𦭝 as an example. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
    • Also, the selection of which glyph to be used to represent a character can be highly subjective due to ambiguous forms. See (formed by minor variations between 㔾/巳) as an example. Also, note that we are unable to create separate items for CJK Compatibility Ideographs (Q2493848). Perhaps you could provide an example based on to illustrate the use of this property. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
      • @KevinUp: I can understand your concern, but without this property, we cannot link characters that can be clearly represented by this property (ex. 相 → 木 and 目). We don't have to compulsory decompose all of the Han characters in Wikidata. A property is just a medium for the expression, and I think linking items is the most important thing in graph databases. Even if it contains mistakes or inconsistent values, it is more useful than no links. In addition, it may be easier to find errors in linked data than in unstructured text. --Okkn (talk) 15:51, 20 July 2018 (UTC)[reply]
        • @Okkn: I think I will support this proposal under two conditions: (1) No "robot and gadget jobs" involved. All values have to be entered manually. (2) Impose a limitation, ie. this property can only be used for characters that have only one unique composition. This can be done by renaming the property as something like "unique IDS". This way, simple characters such as 相 → 木 and 目 can be decomposed while complex Han characters that have more than one IDS are not allowed to use this property. I have added the link to [7] which has less errors compared to [8]. KevinUp (talk) 16:13, 20 July 2018 (UTC)[reply]
          • As for the first point, I don't plan to conduct robot jobs related to this property and I also think we should not import IDSes automatically from the sources above. (But it is technically impossible to prevent other people from doing so...) As for the second, I think we can allow mulitiple IDSes only when the character has CJKV variants (like the above example of "海"). --Okkn (talk) 17:28, 20 July 2018 (UTC)[reply]
@KevinUp: Note I have said "minus indecomposible characters" in the property proposal. about 凞 I have proposed two layouts (warning: collapsed by default as it is very long)

I think this property is useful even for characters that have already been encoded as it provides a structured description of character and it is easy to query for derived characters from its part.--GZWDer (talk) 20:58, 20 July 2018 (UTC)[reply]

@GZWDer: Good job. I prefer Layout 1 (consistent with Okkn's proposal) as it is has better structure. One more thing to take note of when this property is created: Please apply single-value constraint (Q19474404) of "apply to code point" (uncreated property) and "applies to jurisdiction (P1001)" with allowed values confined to mainland China (Q19188), Hong Kong (Q8646), Taiwan (Q865), Japan (Q17), South Korea (Q884) and Vietnam (Q881) (corresponding to GHTJKV on the Unicode charts). The qualifier "writing system (P282)" is unsuitable for IDS as there is no clear guideline. is neither traditional nor simplified. traditional Chinese characters (Q178528) are also used by mainland China (Q19188) in the encoding of historical writings. In some cases the term standard hanzi character (Q8044489) is preferred over Simplified Chinese (Q13414913). See and its derived characters for example.  Support for now. KevinUp (talk) 05:02, 21 July 2018 (UTC)[reply]
@Okkn: In some cases, Unicode characters that are not part of CJK unified ideograph (Q796156) such as CJK Strokes (Q2493860), Geometric Shapes (Q750114), Katakana (Q2493938) and Bopomofo (Q198269) may be used in IDS for better representation. See , , , for example. Should we include these into the allowed values? Note that Geometric Shapes (Q750114) may be used in some Hanja (Q485619) characters such as KevinUp (talk) 06:39, 21 July 2018 (UTC)[reply]
I wonder whether there are cases where we need to separate shinjitai and kyujitai on IDSes. But using only applies to jurisdiction (P1001) as a separator may be reasonable.
From the point of view of Japanese, it is very strange that Katakana (Q2493938) (ヨ) can be the parts of Han characters. I prefer creating new “Chinese character component” items to using extraneous symbols. --Okkn (talk) 13:17, 21 July 2018 (UTC)[reply]
For shinjitai and kyujitai, these characters have been encoded with separate code points in Unicode (some as part of CJK Compatibility Ideographs (Q2493848)), so after the new property "apply to code point" (as suggested in Layout 1) has been created, we can use this property as a second single-value constraint (Q19474404). As for using Katakana (Q2493938) as part of IDS, this is probably only a temporary measure, as quite a few Katakana (Q2493938) has been encoded as part of CJK unified ideograph (Q796156). See 𫡏 (ケ), 𠂇 (ナ) , (メ) for example. Currently, and 𬼖 exist as part of CJK unified ideograph (Q796156) but their mirror images コ and ヨ are not encoded yet so using Katakana in IDS is just a temporary measure. For example, we can use コ in . KevinUp (talk) 14:22, 21 July 2018 (UTC)[reply]
In Wikidata, we don't have to only use encoded characters, but we can use items representing not encoded parts. Is it very common to use katakana as a part of IDSes? In fact, the shape of katakana in Ming (Q1071487) fonts in Japan is really different from that of Han characters (see http://en.glyphwiki.org/wiki/u30e8 and http://en.glyphwiki.org/wiki/koseki-112010). --Okkn (talk) 15:36, 21 July 2018 (UTC)[reply]
Yes, it will be better to create new items to represent unencoded parts rather than using alternative shortcuts such as Katakana (Q2493938). No, it is not common to use katakana as part of IDSes. I just realized that the typographic style used for katakana in Ming (Q1071487) fonts in Japan have a closer resemblance to 宋朝体 (commonly used in educational materials) rather than 明朝体. Because most computer systems use ゴシック体/丸ゴシック体 rather than 明朝体 on web browsers, I had assumed that the typography of katakana would be identical with the typography used by Han characters in 明朝体. I now realize that they are different. Thanks for pointing that out to me. On the other hand, usage of CJK Strokes (Q2493860) in IDSes is much more common as some components such as is not part of CJK unified ideograph (Q796156) (Note that has no meaning on its own). KevinUp (talk) 16:47, 21 July 2018 (UTC)[reply]
By the way, a list of commonly found components that are not yet encoded, eg. the left component of (Q54912005) can be found at the beginning of [9]. KevinUp (talk) 03:40, 22 July 2018 (UTC)[reply]
@GZWDer: What do you think? Would it be appropriate to create new items to represent unencoded parts or shall we use alternatives such as CJK Strokes (Q2493860), Geometric Shapes (Q750114) and Bopomofo (Q198269) for a better representation of Han characters? KevinUp (talk) 16:47, 21 July 2018 (UTC)[reply]
Upon further examination I noticed that used in example 1 is actually part of CJK Radicals Supplement (Q2493859) and is commonly used with CJK Strokes (Q2493860) and CJK unified ideograph (Q796156) in external IDSes such as [10] and [11]. However, Geometric Shapes (Q750114), Bopomofo (Q198269) and Katakana (Q2493938) are uncommon and should be broken down into more primitive components, eg. ⿱𠃍一 instead of (Katakana) and ⿹𠃌丨 instead of (Bopomofo). KevinUp (talk) 03:40, 22 July 2018 (UTC)[reply]
Should we normalize 牜, 𤣩, 礻, 𥫗, 糹, 月 or ⺼(肉部), 艹, 訁, 釒,飠 , 孑, ⺶, etc? --Okkn (talk) 05:15, 22 July 2018 (UTC)[reply]
Yes, I think we should normalize these characters, which are part of CJK Radicals Supplement (Q2493859). KevinUp (talk) 07:12, 22 July 2018 (UTC)[reply]
How about characters such as ㇀, ㇘, ㇗, ㇜, ㇉, ㇌, ㇣ that are part of CJK Strokes (Q2493860)? The IDS for (Q3594919) is ⿻乚 rather than ⿻一乚 KevinUp (talk) 07:12, 22 July 2018 (UTC)[reply]
I think not all of them are parts of CJK Radicals Supplement (Q2493859). With regard to CJK Strokes (Q2493860), if using them is common, they can be allowed. --Okkn (talk) 09:06, 23 July 2018 (UTC)[reply]
Sorry, my mistake. 5 characters from CJK Radicals Supplement (Q2493859) which are , , , , seem to be widely used in IDS, as equivalent forms in CJK unified ideograph (Q796156) are not yet encoded. Most characters in CJK Radicals Supplement (Q2493859) can be represented by instances of CJK unified ideograph (Q796156), eg. 𥫗 (U+25AD7) instead of (U+2EAE). On the other hand, in example 1 is not suitable for IDS as it can only be applied to Ming typefaces used in Japan, Korea and mainland China. (See alternative forms of "雨" for a more detailed explanation). Perhaps it will be better to create new items to represent , , , , , rather than allowing CJK Radicals Supplement (Q2493859) to be used. (Extra note: , can be broken down to smaller components). KevinUp (talk) 12:10, 23 July 2018 (UTC)[reply]
Update: CJK Compatibility Ideographs (Q2493848) such as 凞 (U+FA15) (Q55865594) are now separated from their normalized form, eg. (Q55691246) so usage of the proposed property should be less complicated. Examples (Q54875038) and (Q54874870) given by User:Okkn have been added to demonstrate the use of this property. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
@GZWDer: @Okkn: Refinements have been made to the proposal above. Please check to see if there any improvements that can be made. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
There is just one final peculiarity: The top component of (Q3595029), which is (Q3595028) is written slightly different in Taiwan (Q865)/Hong Kong (Q8646) ([12]) compared with Japan (Q17)/mainland China (Q19188) ([13]). Although only two examples are shown using this property there are actually three different forms for this glyph. The third example is not shown as (Q3595028) can be further decomposed based on its jurisdiction. (See wikt: for more information) KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
Perhaps an additional qualifier might be needed to indicate that (Q3595028) used in example 6 can be further decomposed based on the jurisdiction where the glyph is used. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
Also, (Q55917855) is unsuitable to represent the bottom component of (Q3595029) as it can only be applied to the jurisdiction of mainland China (Q19188) and Vietnam (Q881) [[14]. Other regions such as Japan (Q17), South Korea (Q884), Taiwan (Q865) and Hong Kong (Q8646) use a slightly different form for (Q55917855) [15] that does not represent the bottom component of (Q3595029). To show the two different forms used as the bottom component of (Q3595029), two unencoded Chinese character (Q11093293): ① (Q55917460) (which is identical to the form of (Q55917855) used in mainland China (Q19188)/Vietnam (Q881)) and ② CJK Radical Snout Two (⺕) (Q55917365) [16] (used as the bottom component of (Q3595029) in South Korea (Q884)/Taiwan (Q865)/Hong Kong (Q8646)) have been created. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
Thank you for preparing the items. As for the top component of (Q3595029), at this time I think an additional qualifier is not needed, because the glyph of (Q3595028) varies by regions, and this variation corresponds to that of the top part of (Q3595029). With respect to the bottom of component of (Q3595029), it seems good to use (Q55917460) and CJK Radical Snout Two (⺕) (Q55917365). Can we say that a variant character or a normalized character of (Q55917460) and CJK Radical Snout Two (⺕) (Q55917365) is (Q55917855)? --Okkn (talk) 06:38, 4 August 2018 (UTC)[reply]
From an etymological perspective, the bottom component of (Q3595029) is a reduction of the phonetic element (Q55958424) [17] which is an ideogram (Q138619) of a hand holding a broom while (Q55917855) is a pictogram (Q52827) of a pig's head, so (Q55917460)/CJK Radical Snout Two (⺕) (Q55917365) ("hand") and (Q55917855) ("pig's head") are not exactly related. KevinUp (talk) 13:36, 4 August 2018 (UTC)[reply]

@KevinUp, GZWDer, Okkn: ✓ Done: ideographic description sequence (P5753)Pintoch (talk) 17:33, 31 August 2018 (UTC)[reply]

Thanks! Now we can start using this new property. KevinUp (talk) 17:54, 31 August 2018 (UTC)[reply]