Wikidata:Property proposal/Ideographic description sequences

88889 as of Unicode 11, ~~minus indecomposable characters~~ Note: New items will need to be created for components that are not yet encoded in Unicode. Examples of such components can be found in IRG Working Set 2015 [1]

Expected completeness

eventually complete (Q21873974)

Robot and gadget jobs

No

Motivation

This is a way to represent characters, especially unencoded characters like 𱁬 (Q7676480). Note we need a qualifier for example 3. GZWDer (talk) 17:48, 16 July 2018 (UTC)[reply]

Discussion

Comment @GZWDer: Ideographic description sequences is useful to describe the glyphs of Han characters. However, I think, in Wikidata, we should store this kind of data as items (with a qualifier series ordinal (P1545)), rather than strings. By doing so, we can easily generate the composition trees of Han characters. What do you think? --Okkn (talk) 18:06, 16 July 2018 (UTC)[reply]
@Okkn: Note the representation is an ordered tree, not a sequence. for example ⿱⿳亠丷冖巾 means ⿱(⿳亠丷冖)巾.--GZWDer (talk) 18:09, 16 July 2018 (UTC)[reply]
- @GZWDer: I know that. But the sequence for the "箱", for example, is "⿱𥫗相", not "⿱𥫗(⿰木目)", right? Or both two sequences should be stored in this property?--Okkn (talk) 18:17, 16 July 2018 (UTC)[reply]
  - A character may have more than one ideographic description sequences, but the shortest possible Ideographic Description Sequence is preferred. Note Unicode Standard does not define equivalence for two Ideographic Description Sequences that are not identical. See page 424 of [2].--GZWDer (talk) 18:23, 16 July 2018 (UTC)[reply]
  - Note ideographic description sequences are prefix notation (Q214510) and brackets never appear in sequences.--GZWDer (talk) 18:25, 16 July 2018 (UTC)[reply]
    - I know IDSes. Ok, I also think we should only store the shortest one. In that case, shouldn't we link from "箱" to "相" or from "相" to "木"? I want to draw a tree like Figure 1 on the page 131 of [3]. --Okkn (talk) 18:33, 16 July 2018 (UTC)[reply]
    - Examples of representing IDSes as structured data are shown below. This property can be a subproperty of part of (P361). --Okkn (talk) 20:10, 16 July 2018 (UTC)[reply]

For 箱 (Q54875038): ① ⿱ (Q55589919) (series ordinal (P1545): 1) ② 𥫗 (Q55885207) (series ordinal (P1545): 2) ③ 相 (Q54874870) (series ordinal (P1545): 3)

IDSes

⿱

edit

series ordinal

1

0 references

add reference

𥫗

edit

series ordinal

2

0 references

add reference

相

edit

series ordinal

3

0 references

add reference

add value

For 相 (Q54874870): ① ⿰ (Q55589918) (series ordinal (P1545): 1) ② 木 (Q3594983) (series ordinal (P1545): 2) ③ 目 (Q54552546) (series ordinal (P1545): 3)

IDSes

⿰

edit

series ordinal

1

0 references

add reference

木

edit

series ordinal

2

0 references

add reference

目

edit

series ordinal

3

0 references

add reference

add value

Then we will create a lot of items about Han character components. (Yes, [4] includes unencoded components.) How to deal with things like ⿰氵每 (China, Hong Kong, Taiwan, South Korea, Vietnam); ⿰氵毎 (Japan)?--GZWDer (talk) 21:57, 16 July 2018 (UTC)[reply]

If we import all Han characters encoded in Unicode 11, the number of components is much lower than that. I think that it's not particularly a problem. Also, in some environment, even the encoded components can't be displayed correctly. So using items brings considerable benefits to everyone.

Variations can be represented as follow:

IDSes

⿰

edit

series ordinal

1

0 references

add reference

氵

edit

series ordinal

2

0 references

add reference

每

edit

series ordinal

3

writing system	traditional Chinese characters
	simplified Chinese characters
	kyūjitai, Hanja, etc...

0 references

add reference

edit

3

0 references

add reference

add value

I'm not sure whether the common parts (⿰ and 氵) should be duplicated or not. --Okkn (talk) 05:55, 17 July 2018 (UTC)[reply]

The common parts should be listed again to make things clearer. An additional qualifier applies to jurisdiction (P1001) or writing system (P282) will be used to differentiate between the two sets of data. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

@GZWDer: Could you please change the datatype? Then I will support it. --Okkn (talk) 17:30, 19 July 2018 (UTC)[reply]
Changed. GZWDer (talk) 17:35, 19 July 2018 (UTC)[reply]
- Support Thanks! --Okkn (talk) 17:40, 19 July 2018 (UTC)[reply]
~~Oppose~~ (changed to Support) There are many mistakes in external lists such as [5]. This information is readily available on English Wiktionary but many errors exist due to mass copying without proper checking. The use of ideographic description character (Q55589899) is more appropriate for characters that are not yet encoded in Unicode. As for characters that have already been encoded, character forms or glyph shapes are better depicted in graphical form, eg. as images or by referring to GlyphWiki ID (P5467). Also, not all characters can be represented by the proposed property, eg. 慶 (inner glyph cannot be decomposed). In addition, there are many inconsistencies in Unicode encoding. Take a look at the derived characters of 兹 as an example. Also, in some cases this data is better described in textual form rather than in the form of structured data to be understood by humans. See 凞 as an example. KevinUp (talk)
- I'm aware that this data is especially useful for generating composition trees of characters but I regret to say that there are too many inconsistencies in Unicode, especially for archaic and rarely used characters. This sort of data needs to be hand sorted to be meaningful and free from error. Information on composition trees can be partially obtained from English Wiktionary by referring to the section on "derived characters". See 𦭝 as an example. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
  - Update: I found this: [6] which can be used to query derived characters. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]
- Also, the selection of which glyph to be used to represent a character can be highly subjective due to ambiguous forms. See 頋 (formed by minor variations between 㔾/巳) as an example. Also, note that we are unable to create separate items for CJK Compatibility Ideographs (Q2493848). Perhaps you could provide an example based on 殺 to illustrate the use of this property. KevinUp (talk) 14:51, 20 July 2018 (UTC)[reply]
  - @KevinUp: I can understand your concern, but without this property, we cannot link characters that can be clearly represented by this property (ex. 相 → 木 and 目). We don't have to compulsory decompose all of the Han characters in Wikidata. A property is just a medium for the expression, and I think linking items is the most important thing in graph databases. Even if it contains mistakes or inconsistent values, it is more useful than no links. In addition, it may be easier to find errors in linked data than in unstructured text. --Okkn (talk) 15:51, 20 July 2018 (UTC)[reply]
    - @Okkn: I think I will support this proposal under two conditions: (1) No "robot and gadget jobs" involved. All values have to be entered manually. (2) ~~Impose a limitation, ie. this property can only be used for characters that have only one unique composition. This can be done by renaming the property as something like "unique IDS".~~ This way, simple characters such as 相 → 木 and 目 can be decomposed while complex Han characters that have more than one IDS are not allowed to use this property. I have added the link to [7] which has less errors compared to [8]. KevinUp (talk) 16:13, 20 July 2018 (UTC)[reply]
      - As for the first point, I don't plan to conduct robot jobs related to this property and I also think we should not import IDSes automatically from the sources above. (But it is technically impossible to prevent other people from doing so...) As for the second, I think we can allow mulitiple IDSes only when the character has CJKV variants (like the above example of "海"). --Okkn (talk) 17:28, 20 July 2018 (UTC)[reply]

@KevinUp: Note I have said "minus indecomposible characters" in the property proposal. about 凞 I have proposed two layouts (warning: collapsed by default as it is very long)

I think this property is useful even for characters that have already been encoded as it provides a structured description of character and it is easy to query for derived characters from its part.--GZWDer (talk) 20:58, 20 July 2018 (UTC)[reply]

@GZWDer: Good job. I prefer Layout 1 (consistent with Okkn's proposal) as it is has better structure. One more thing to take note of when this property is created: Please apply single-value constraint (Q19474404) of "apply to code point" (uncreated property) and "applies to jurisdiction (P1001)" with allowed values confined to mainland China (Q19188), Hong Kong (Q8646), Taiwan (Q865), Japan (Q17), South Korea (Q884) and Vietnam (Q881) (corresponding to GHTJKV on the Unicode charts). The qualifier "writing system (P282)" is unsuitable for IDS as there is no clear guideline. 凞 is neither traditional nor simplified. traditional Chinese characters (Q178528) are also used by mainland China (Q19188) in the encoding of historical writings. In some cases the term standard hanzi character (Q8044489) is preferred over Simplified Chinese (Q13414913). See 敢 and its derived characters for example.

Support for now. KevinUp (talk) 05:02, 21 July 2018 (UTC)[reply]

@Okkn: In some cases, Unicode characters that are not part of CJK unified ideograph (Q796156) such as CJK Strokes (Q2493860), Geometric Shapes (Q750114), Katakana (Q2493938) and Bopomofo (Q198269) may be used in IDS for better representation. See 㔔, 骨, 厁, 雪 for example. Should we include these into the allowed values? Note that Geometric Shapes (Q750114) may be used in some Hanja (Q485619) characters such as 㐃 KevinUp (talk) 06:39, 21 July 2018 (UTC)[reply]

I wonder whether there are cases where we need to separate shinjitai and kyujitai on IDSes. But using only applies to jurisdiction (P1001) as a separator may be reasonable.

From the point of view of Japanese, it is very strange that Katakana (Q2493938) (ヨ) can be the parts of Han characters. I prefer creating new “Chinese character component” items to using extraneous symbols. --Okkn (talk) 13:17, 21 July 2018 (UTC)[reply]

For shinjitai and kyujitai, these characters have been encoded with separate code points in Unicode (some as part of CJK Compatibility Ideographs (Q2493848)), so after the new property "apply to code point" (as suggested in Layout 1) has been created, we can use this property as a second single-value constraint (Q19474404). As for using Katakana (Q2493938) as part of IDS, this is probably only a temporary measure, as quite a few Katakana (Q2493938) has been encoded as part of CJK unified ideograph (Q796156). See 𫡏 (ケ), 𠂇 (ナ) 㐅, (メ) for example. Currently, 匚 and 𬼖 exist as part of CJK unified ideograph (Q796156) but their mirror images コ and ヨ are not encoded yet so using Katakana in IDS is just a temporary measure. For example, we can use コ in 囙. KevinUp (talk) 14:22, 21 July 2018 (UTC)[reply]

In Wikidata, we don't have to only use encoded characters, but we can use items representing not encoded parts. Is it very common to use katakana as a part of IDSes? In fact, the shape of katakana in Ming (Q1071487) fonts in Japan is really different from that of Han characters (see http://en.glyphwiki.org/wiki/u30e8 and http://en.glyphwiki.org/wiki/koseki-112010). --Okkn (talk) 15:36, 21 July 2018 (UTC)[reply]

Yes, it will be better to create new items to represent unencoded parts rather than using alternative shortcuts such as Katakana (Q2493938). No, it is not common to use katakana as part of IDSes. I just realized that the typographic style used for katakana in Ming (Q1071487) fonts in Japan have a closer resemblance to 宋朝体 (commonly used in educational materials) rather than 明朝体. Because most computer systems use ゴシック体/丸ゴシック体 rather than 明朝体 on web browsers, I had assumed that the typography of katakana would be identical with the typography used by Han characters in 明朝体. I now realize that they are different. Thanks for pointing that out to me. On the other hand, usage of CJK Strokes (Q2493860) in IDSes is much more common as some components such as ㇉ is not part of CJK unified ideograph (Q796156) (Note that ㇉ has no meaning on its own). KevinUp (talk) 16:47, 21 July 2018 (UTC)[reply]

By the way, a list of commonly found components that are not yet encoded, eg. the left component of 段 (Q54912005) can be found at the beginning of [9]. KevinUp (talk) 03:40, 22 July 2018 (UTC)[reply]

@GZWDer: What do you think? Would it be appropriate to create new items to represent unencoded parts or shall we use alternatives such as CJK Strokes (Q2493860), ~~Geometric Shapes (Q750114)~~ and ~~Bopomofo (Q198269)~~ for a better representation of Han characters? KevinUp (talk) 16:47, 21 July 2018 (UTC)[reply]

Upon further examination I noticed that ⻗ used in example 1 is actually part of CJK Radicals Supplement (Q2493859) and is commonly used with CJK Strokes (Q2493860) and CJK unified ideograph (Q796156) in external IDSes such as [10] and [11]. However, ~~Geometric Shapes (Q750114)~~, ~~Bopomofo (Q198269)~~ and ~~Katakana (Q2493938)~~ are uncommon and should be broken down into more primitive components, eg. ⿱𠃍一 instead of コ (Katakana) and ⿹𠃌丨 instead of ㄗ (Bopomofo). KevinUp (talk) 03:40, 22 July 2018 (UTC)[reply]

Should we normalize 牜, 𤣩, 礻, 𥫗, 糹, 月 or ⺼(肉部), 艹, 訁, 釒,飠 , 孑, ⺶, etc? --Okkn (talk) 05:15, 22 July 2018 (UTC)[reply]

Yes, I think we should normalize these characters, which are part of CJK Radicals Supplement (Q2493859). KevinUp (talk) 07:12, 22 July 2018 (UTC)[reply]

How about characters such as ㇀, ㇘, ㇗, ㇜, ㇉, ㇌, ㇣ that are part of CJK Strokes (Q2493860)? The IDS for 七 (Q3594919) is ⿻㇀乚 rather than ⿻一乚 KevinUp (talk) 07:12, 22 July 2018 (UTC)[reply]

I think not all of them are parts of CJK Radicals Supplement (Q2493859). With regard to CJK Strokes (Q2493860), if using them is common, they can be allowed. --Okkn (talk) 09:06, 23 July 2018 (UTC)[reply]

Sorry, my mistake. 5 characters from CJK Radicals Supplement (Q2493859) which are ⺼, ⺶, ⺌, ⺗, ⺀ seem to be widely used in IDS, as equivalent forms in CJK unified ideograph (Q796156) are not yet encoded. Most characters in CJK Radicals Supplement (Q2493859) can be represented by instances of CJK unified ideograph (Q796156), eg. 𥫗 (U+25AD7) instead of ⺮ (U+2EAE). On the other hand, ⻗ in example 1 is not suitable for IDS as it can only be applied to Ming typefaces used in Japan, Korea and mainland China. (See alternative forms of "雨" for a more detailed explanation). Perhaps it will be better to create new items to represent ⺼, ⺶, ⺌, ⺗, ⺀, rather than allowing CJK Radicals Supplement (Q2493859) to be used. (Extra note: ⺗, ⺀ can be broken down to smaller components). KevinUp (talk) 12:10, 23 July 2018 (UTC)[reply]

Update: CJK Compatibility Ideographs (Q2493848) such as 凞 (U+FA15) (Q55865594) are now separated from their normalized form, eg. 凞 (Q55691246) so usage of the proposed property should be less complicated. Examples 箱 (Q54875038) and 相 (Q54874870) given by User:Okkn have been added to demonstrate the use of this property. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

@GZWDer: @Okkn: Refinements have been made to the proposal above. Please check to see if there any improvements that can be made. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

There is just one final peculiarity: The top component of 雪 (Q3595029), which is 雨 (Q3595028) is written slightly different in Taiwan (Q865)/Hong Kong (Q8646) ([12]) compared with Japan (Q17)/mainland China (Q19188) ([13]). Although only two examples are shown using this property there are actually three different forms for this glyph. The third example is not shown as 雨 (Q3595028) can be further decomposed based on its jurisdiction. (See wikt:雨 for more information) KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

Perhaps an additional qualifier might be needed to indicate that 雨 (Q3595028) used in example 6 can be further decomposed based on the jurisdiction where the glyph is used. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

Also, 彐 (Q55917855) is unsuitable to represent the bottom component of 雪 (Q3595029) as it can only be applied to the jurisdiction of mainland China (Q19188) and Vietnam (Q881) [[14]. Other regions such as Japan (Q17), South Korea (Q884), Taiwan (Q865) and Hong Kong (Q8646) use a slightly different form for 彐 (Q55917855) [15] that does not represent the bottom component of 雪 (Q3595029). To show the two different forms used as the bottom component of 雪 (Q3595029), two unencoded Chinese character (Q11093293): ① 彐 (Q55917460) (which is identical to the form of 彐 (Q55917855) used in mainland China (Q19188)/Vietnam (Q881)) and ② CJK Radical Snout Two (⺕) (Q55917365) [16] (used as the bottom component of 雪 (Q3595029) in South Korea (Q884)/Taiwan (Q865)/Hong Kong (Q8646)) have been created. KevinUp (talk) 13:50, 3 August 2018 (UTC)[reply]

Thank you for preparing the items. As for the top component of 雪 (Q3595029), at this time I think an additional qualifier is not needed, because the glyph of 雨 (Q3595028) varies by regions, and this variation corresponds to that of the top part of 雪 (Q3595029). With respect to the bottom of component of 雪 (Q3595029), it seems good to use 彐 (Q55917460) and CJK Radical Snout Two (⺕) (Q55917365). Can we say that a variant character or a normalized character of 彐 (Q55917460) and CJK Radical Snout Two (⺕) (Q55917365) is 彐 (Q55917855)? --Okkn (talk) 06:38, 4 August 2018 (UTC)[reply]

From an etymological perspective, the bottom component of 雪 (Q3595029) is a reduction of the phonetic element 彗 (Q55958424) [17] which is an ideogram (Q138619) of a hand holding a broom while 彐 (Q55917855) is a pictogram (Q52827) of a pig's head, so 彐 (Q55917460)/CJK Radical Snout Two (⺕) (Q55917365) ("hand") and 彐 (Q55917855) ("pig's head") are not exactly related. KevinUp (talk) 13:36, 4 August 2018 (UTC)[reply]

@KevinUp, GZWDer, Okkn: Done: ideographic description sequence (P5753) − Pintoch (talk) 17:33, 31 August 2018 (UTC)[reply]

Thanks! Now we can start using this new property. KevinUp (talk) 17:54, 31 August 2018 (UTC)[reply]

Wikidata:Property proposal/Ideographic description sequences

Contents

Ideographic description sequences

Motivation

Discussion

For 凞

For ⿰冫⿲丿臣巳

For ⿲丿臣巳

For ⿲冫臣㔾

For ⿲冫臣巳

For 凞

For ⿰冫⿲丿臣巳

For ⿲丿臣巳

For ⿲冫臣㔾

For ⿲冫臣巳

Navigation menu

Wikidata:Property proposal/Ideographic description sequences

Ideographic description sequences

Motivation

Discussion

For 凞

For ⿰冫⿲丿臣巳

For ⿲丿臣巳

For ⿲冫臣㔾

For ⿲冫臣巳

For 凞

For ⿰冫⿲丿臣巳

For ⿲丿臣巳

For ⿲冫臣㔾

For ⿲冫臣巳

Navigation menu

Search