L2/15-159 Title: Plan for Obsoleting StandardizedVariants.html in Unicode 9.0 Author: Ken Whistler Date: May 18, 2015 Action: For consideration by the UTC Background The file StandardizedVariants.html has formally been a part of the Unicode Character Database for many years now (since 2002). Its original function was to provide a visual reference for the glyph variants defined for the various math symbols added in Unicode 3.2, along with the various Mongolian glyph variants associated with the use of the Mongolian variation selectors. Starting with Unicode 4.0, the normative function of enumerating the list of standardized variation sequences was passed on to a new text-only data file, StandardizedVariants.txt. StandardizedVariants.html was retained, but it became an adjunct reference, derived from the definitive listing in StandardizedVariants.txt. The situation for these two files has continued essentially unchanged since 2003. However, there have been some recent developments whose net effect has been to turn StandardizedVariants.html into a mostly obsolete file with little real value. Furthermore, it has become significantly harder to maintain correctly, for a number of reasons. The first of those developments is the addition of functionality to the Unibook tool used to format Unicode code charts. That tool now has built-in mechanisms which parse StandardizedVariants.txt and which then can format and display the variants for some types of standardized variation sequences directly in the context of the code charts. The second of those developments is the relatively recent addition of very large numbers of standardized variation sequences, some for CJK compatibility ideographs and some for emoji. Both of those sets of additions have posed new challenges for maintenance of StandardizedVariants.html. The maintenance of StandardizedVariants.html has also created a significant burden, because it depends on a series of hacks to display the correct glyphs. In particular, it is stuck in a maintenance catch-22 with the refglyph database, which has not officially been updated since Unicode 3.2. That means that the glyphs it uses for display have in some cases gotten out of synch with those used in the current code charts, and updating them requires arbitrary, undocumented changes to the refglyph database contents. That also implies version instability for display of the StandardizedVariants.html page. Proposal The proposal here is to simply obsolete StandardizedVariants.html as of the Unicode 9.0 release. Context To plan out a strategy for obsoleting StandardizedVariants.html, it is first necessary to spell out the current context for the enumeration and exemplification of variation sequences for the Unicode Standard. We need to keep in mind that there are actually several different types, with different requirements for documentation. 1. Ideographic Variation Sequences These actually comprise the majority of defined variation sequences for Unicode. They do not formally constitute *standardized* variation sequences, because they are not listed in StandardVariants.txt. Instead, they are defined by a registration process, in the Ideographic Variation Database. See: http://www.unicode.org/ivd/ An IVS always take a unified ideograph as base. They only involve variation selectors from the Plane 14 range (U+E0100..U+E01EF). They are created by a registration process (see UTS #37). They are not listed in StandardizedVariants.txt, have never been displayed in StandardizedVariants.html, and are never displayed in the code charts. Because of these distinctions, the IVS have no direct bearing on the issues related to the display of standardized variation sequences or the obsoleting of StandardizedVariants.html. 2. Standardized Variation Sequences These come in four significant flavors. 2a. "Normal" Standardized Variation Sequences These are the original set of mathematical symbol variants, plus a small collection of additions over the years to deal with a few script edge cases. These never take a unified ideograph as base. They only involve variation selectors VS1..VS14 (U+FE00..U+FE0D). Since Unicode 4.0 they have always been listed exhaustively in StandardizedVariants.txt and the variant glyphs have been displayed in StandardizedVariants.html. As of Unicode 7.0, the sequences and variant glyphs for these are also displayed in the Unicode code charts. See, e.g.: http://www.unicode.org/charts/PDF/U2200.pdf 2b. Mongolian Standardized Variation Sequences These are the sequences defined for the complete rendering of Mongolian in the current model. These only take a Mongolian letter as base. They only involve Mongolian free variation selectors MVS1..MVS3 (U+180B..U+180D). Since Unicode 4.0 they have always been listed exhaustively in StandardizedVariants.txt and the variant glyphs have been displayed in StandardizedVariants.html. As of Unicode 7.0, the sequences and variant glyphs for these Mongolian sequences are also displayed in the Unicode code charts. See: http://www.unicode.org/charts/PDF/U1800.pdf 2c. Emoji Standardized Variation Sequences These are the set of standardized variation sequences added starting in Unicode 6.1.0 to account for text style versus emoji style defaults for a number of emoji characters. These never take a unified ideograph as base. They only involve variation selectors VS15..VS16 (U+FE0E..U+FE0F). Since Unicode 6.1 they have always been listed exhaustively in StandardizedVariants.txt and the variant glyphs have been displayed in StandardizedVariants.html. The emoji variation sequences and variants glyphs have *never* been displayed in the code charts, because of the font technology limitations of the relevant tooling. 2d. CJK Compatibility Standardized Variation Sequences These are the set of standardized variation sequences added as of Unicode 6.3.0 to cover variations relevant to CJK compatibility ideograph rendering support. These always take a unfied ideograph as base. They only involve variation selectors VS1..VS3 (U+FE00..U+FE02). Since Unicode 6.3 they have always been listed exhaustively in StandardizedVariants.txt. However, they are *not* displayed in StandardizedVariants.html. As of Unicode 7.0, the standardized variation sequences associated with the CJK compatibility ideographs are also not displayed in the code charts. The glyphs, however, *are* displayed in the code charts, because the point of the addition of these variation sequences was to specify sequences to be interpreted as those specific glyphs. Implications The implications of the current context just outlined are as follows: A. StandardizedVariants.html does not (and never has) displayed glyphs for either the IVS (class 1) or the CJK Compatibility (class 2d). So those CJK sequences are irrelevant to its disposition. B. The display of the normal and Mongolian variation sequences (class 2a and 2b) is now routinely done in the code charts, where they are actually more useful. Therefore, any display of those sets of variation sequences in StandardizedVariants.html is now redundant and obsolete. C. The *only* set of variant glyphs not otherwise accounted for are the emoji variation sequences (class 2c). Therefore, to obsolete StandardizedVariants.html itself, one needs only to find a way to continue presenting these particular emoji variation sequences. Plan The simplest way forward for Unicode 9.0 would simply be to remove the redundant display of class 2a and class 2b variation sequences in StandardizedVariants.html and to document that change. However, I think the better way forward is just to take the plunge and remove the need for further generation and maintenance of StandardizedVariants.html as of Unicode 9.0. This could be accomplished by taking the following steps. 1. Create an alternative display vehicle for the emoji variation sequences (and *only* those sequences). There is a natural location for such an alternative display vehicle, now that the charts associated with UTR #51 will be located in: http://www.unicode.org/emoji/charts/ I suggest that the file be titled: EmojiVariationSequences.html. The content required is very simple. It need specify only the base + VS sequences and the associated glyphs. The descriptive names currently also listed are somewhat redundant. The only really significant data addition should be a Unicode version (Age) field indicating when the particular emoji standardized variation sequence was added to the standard. A possible format for a line of the table would be: 2747 SPARKLE U6_1 2747 FE0E {glyph} 2747 FE0F {glyph} Adding the Unicode version value as a column would render this page version-independent and additive. It could be referenced for all versions of the Unicode Standard, and would only need to be updated when a decision is made to define new sequences of this type. This file could easily be generated by a simple repurposing of the script already used to generate StandardizedVariants.html, and using already existing glyphs for it (or if so desired, updated emoji glyphs from the collection otherwise used for emoji charts). 2. Completely obsolete StandardizedVariants.html in the UCD. To keep links stable and confusion down, I suggest that for a couple versions at least, we simply drop a rump file with the same name in the UCD, whose content would simply be documentation explaining where to go find the display of the variant glyphs, with URLs pointing to the relevant locations. 3. Update the documentation to explain the cutover. This would require hitting a few places: 3a. Add some additional explanation in Section 23.4, Variation Selectors, in the core specification. 3b. Add some additional explanation regarding what is and is not displayed for standardized variation sequences in the code charts, in Section 24.1, Character Names List, in the core specification. 3c. Update the documentation in UAX #44 regarding StandardizedVariants.html and the replacement display vehicle. 3d. Beef up the FAQ entries about variation sequences to provide current information about where to find all the glyph variant displays. 3e. Add whatever is necessary to emoji/charts/index.html to provide context and a link to the new EmojiVariationSequences.html table as part of the resources there. 3f. Add a short section and link in UTR #51 near the point where the emoji variation sequences are defined or discussed.