L2/11-449 Title: Extensions to Ideographic Description Sequences Date: November 23, 2011 Source: Ken Whistler Action: For consideration by the UTC Background Andrew West noted a problem in the data files for UTR #45 which has potential implications for the specification of Ideographic Description Sequences (IDS) in the Unicode Standard. From his email: The Unicode Standard 12.2 states that "No [IDS] sequence can be longer than 16 Unicode code points in length", and "[a] sequence of characters that includes Ideographic Description characters but does not conform to the grammar and length constraints described here is not an Ideographic Description Sequence". However the character biáng (see http://en.wikipedia.org/wiki/Bi%C3%A1ngbi%C3%A1ng_noodles) which is now a candidate for encoding (UTC-00791) requires an IDS that is greater than 16 characters. The IDS for this character at http://www.unicode.org/reports/tr45/tr45-sourcedata-4.txt is 18 characters in length, but omits a heart component (I know that Ken Lunde is aware of this): ⿺辶⿱穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂. My analysis gets the IDS down to 17 characters in length (⿺辶⿳穴⿲月⿱⿲玄言玄⿲長馬長刂心) , but either way it breaks the 16 character limit, and is thus not an Ideographic Description Sequence according to the Unicode Standard. Yet the IDS for a character is required for proposing it for encoding, and the illegal IDS sequence is published on the Unicode website, which is a bit awkward as the Unicode Consortium should not break its own rules. I do think this is a problem, but I think the easiest way to handle this in the standard would be to leave the current definition of the Ideographic Description Sequence intact, but add a definition for Extended IDS which relaxes the numerical constraints. In particular I suggest: 1. In Section 12.2 of TUS add a paragraph which defines "Extended Ideographic Description Sequences" as sequences which obey the syntax rules for IDS, but which exceed one or more of the numerical limits. (total length > 16, back-scan length > 6). Qualify the discussion of E-IDS with the caveat that nobody should expect implementations which interpret IDS in some systematic way (e.g., attempting to display constructed glyphs for them, or otherwise doing automated processing on them) to similarly handle E-IDS, because they exceed reasonable numerical limits for implementations. Then admit that there may be contexts in which the use of an E-IDS as a *manually* interpretable description of particularly bizarre ideographs might be helpful, e.g. in IRG discussions of bizarre ideographs under consideration for encoding. Add biáng as an example of an E-IDS to the figure(s) in Section 12.2. 2. In UTR #45, Section 2, change the description of field 6 from: An ideographic description sequence (IDS) for the ideograph, if one can be generated. to An ideographic description sequence (IDS) or extended ideographic description sequence (E-IDS) for the ideograph, if one can be generated. I think these changes would be sufficient to address the problem. My suggestion to call these "Extended Ideographic Description Sequences" (E-IDS ~ EIDS) might not be the best terminology for this, if somebody can come up with better. Note that there have also been suggestions to extend the syntax of IDS by allowing other than Unified_CJK_Ideograph and CJK_Radical as terminals in the BNF for IDS (See Section 12.2.) This might then be applied to allow IDS description of Tangut or Jurchen or other such non-Han (but siniform) ideographic repertoires. So any decision regarding terminology for a possible extension of IDS which exceeds the numerical limits should take such proposals into account.