L2/07-172 From: Peter Constable Date: Sat, 12 May 2007 Subject: Constraint on Ideographic Variation Selector Sequences -------------------------------------------------------- Constraint on Ideographic Variation Selector Sequences Peter Constable, Microsoft 2007-5-11 UTS #37 defines an "Ideographic Variation Sequence (IVS)" as follows: "a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF." This statement is completely explicit about what sequences can or cannot be considered an IVS. UTS #37 is also clear that only an IVS can appear in the Ideographic Variation Database. What is unclear, however, is the status of potential sequences < X, VSn> where X is a character with the Unified_Ideograph property and VSn is a variation-selector character in the range U+FE00 to U+FE0F: is this a class of sequences that will never be sanctioned for standard use, or might such sequences someday be sanctioned by inclusion in Standardized-Variants.txt? It would be useful for this uncertainty to be clarified and made certain, either to declare such sequences possible candidates for sanctioned use or to declare them permanently restricted from sanctioned use. The concern is that implementers reading UTS #37 might get the impression that such sequences will never be sanctioned and embody that assumption in implementations, only later to find that their implementations are broken if such a sequence becomes sanctioned; or, on the other hand, that implementers complicate their implementations or make them less efficient to allow for a possibility that might never be realized. For example, consider a process that needs to record for each ideographic character from some repertoire a pointer to a visual representation (such as a glyph) of a variant, for some set of supported variants. One implementer might create pointer arrays of size 240, one for each VS from U+E0100 to U+E01EF, and thereby risk one day finding that the implementation does not support a newly-sanctioned sequence using a VS from U+FE00 to U+FE0F. Another implementer might create pointer arrays of size 256 and risk that the first 16 values are never used -- resulting in (say) 70,000+ x 16 x 2 bytes = 2.13+ MB of ever-wasted storage or working set. This issue has presented itself in the context of considering possible specifications for supporting Variation Selector sequences in font files. Thus, though such specifications are still under investigation, this issue pertains to real architectural issues and is not merely hypothetical. Thus, it would be helpful for UTC to resolve this open issue as soon as possible. Given that there might never be requests for variation-selector sequences involving ideographs and variation-selector characters from U+FE00 to U+FE0F (even if such sequences were deemed permissible), and given that requests for variation-selector sequences involving ideographs can always be processed using the process defined in UTS #37, regardless of the source, it seems that there is no real need for sequences for ideographs using U+FE00 to U+FE0F, and, in turn, positive impact on memory and storage to be gained if such sequences can be confidently ignored. Therefore, I propose that UTC adopt policy stating that variation-selector sequences for characters with the Unified_Ideograph property can only ever be sanctioned under the terms of UTS #37, and, in particular, using variation-selector characters from the range U+E0100 to U+E01EF. This policy should be stated within the standard (probably in Ch. 15, in the section on Variation Selectors), and in UTS #37. --------------------------------------------------------