L2/15-159


Title: Plan for Obsoleting StandardizedVariants.html in Unicode 9.0
Author: Ken Whistler
Date: May 18, 2015
Action: For consideration by the UTC


Background

The file StandardizedVariants.html has formally been a part of the
Unicode Character Database for many years now (since 2002). Its original
function was to provide a visual reference for the glyph variants defined
for the various math symbols added in Unicode 3.2, along with the
various Mongolian glyph variants associated with the use of the
Mongolian variation selectors.

Starting with Unicode 4.0, the normative function of enumerating the
list of standardized variation sequences was passed on to a new
text-only data file, StandardizedVariants.txt. StandardizedVariants.html
was retained, but it became an adjunct reference, derived from the
definitive listing in StandardizedVariants.txt.

The situation for these two files has continued essentially unchanged
since 2003. However, there have been some recent developments whose
net effect has been to turn StandardizedVariants.html into a mostly
obsolete file with little real value. Furthermore, it has become
significantly harder to maintain correctly, for a number of reasons.

The first of those developments is the addition of functionality to
the Unibook tool used to format Unicode code charts. That tool now
has built-in mechanisms which parse StandardizedVariants.txt and
which then can format and display the variants for some types of
standardized variation sequences directly in the context of the
code charts.

The second of those developments is the relatively recent addition
of very large numbers of standardized variation sequences, some
for CJK compatibility ideographs and some for emoji. Both of those
sets of additions have posed new challenges for maintenance of
StandardizedVariants.html.

The maintenance of StandardizedVariants.html has also created a
significant burden, because it depends on a series of hacks to
display the correct glyphs. In particular, it is stuck in a
maintenance catch-22 with the refglyph database, which has not
officially been updated since Unicode 3.2. That means that the
glyphs it uses for display have in some cases gotten out of synch
with those used in the current code charts, and updating them
requires arbitrary, undocumented changes to the refglyph database
contents. That also implies version instability for display of
the StandardizedVariants.html page.


Proposal

The proposal here is to simply obsolete StandardizedVariants.html
as of the Unicode 9.0 release.


Context

To plan out a strategy for obsoleting StandardizedVariants.html, it
is first necessary to spell out the current context for the
enumeration and exemplification of variation sequences for the
Unicode Standard. We need to keep in mind that there are actually
several different types, with different requirements for
documentation.

1. Ideographic Variation Sequences

These actually comprise the majority of defined variation sequences
for Unicode. They do not formally constitute *standardized*
variation sequences, because they are not listed in
StandardVariants.txt. Instead, they are defined by a registration
process, in the Ideographic Variation Database. See:

http://www.unicode.org/ivd/

An IVS always take a unified ideograph as base. They only involve
variation selectors from the Plane 14 range (U+E0100..U+E01EF).
They are created by a registration process (see UTS #37).
They are not listed in StandardizedVariants.txt, have never
been displayed in StandardizedVariants.html, and are never
displayed in the code charts.

Because of these distinctions, the IVS have no direct bearing on
the issues related to the display of standardized variation
sequences or the obsoleting of StandardizedVariants.html.

2. Standardized Variation Sequences

These come in four significant flavors.

2a. "Normal" Standardized Variation Sequences

These are the original set of mathematical symbol variants, plus a
small collection of additions over the years to deal with a few
script edge cases.

These never take a unified ideograph as base. They only involve
variation selectors VS1..VS14 (U+FE00..U+FE0D). Since Unicode 4.0
they have always been listed exhaustively in StandardizedVariants.txt
and the variant glyphs have been displayed in StandardizedVariants.html.

As of Unicode 7.0, the sequences and variant glyphs for these are
also displayed in the Unicode code charts. See, e.g.:

http://www.unicode.org/charts/PDF/U2200.pdf

2b. Mongolian Standardized Variation Sequences

These are the sequences defined for the complete rendering of Mongolian
in the current model.

These only take a Mongolian letter as base. They only involve
Mongolian free variation selectors MVS1..MVS3 (U+180B..U+180D).
Since Unicode 4.0 they have always been listed exhaustively in
StandardizedVariants.txt and the variant glyphs have been
displayed in StandardizedVariants.html.

As of Unicode 7.0, the sequences and variant glyphs for these
Mongolian sequences are also displayed in the Unicode code charts. See:

http://www.unicode.org/charts/PDF/U1800.pdf

2c. Emoji Standardized Variation Sequences

These are the set of standardized variation sequences added starting
in Unicode 6.1.0 to account for text style versus emoji style
defaults for a number of emoji characters.

These never take a unified ideograph as base. They only involve
variation selectors VS15..VS16 (U+FE0E..U+FE0F). Since Unicode 6.1 
they have always been listed exhaustively in StandardizedVariants.txt
and the variant glyphs have been displayed in StandardizedVariants.html.

The emoji variation sequences and variants glyphs have *never*
been displayed in the code charts, because of the font technology
limitations of the relevant tooling.

2d. CJK Compatibility Standardized Variation Sequences

These are the set of standardized variation sequences added as of
Unicode 6.3.0 to cover variations relevant to CJK compatibility
ideograph rendering support.

These always take a unfied ideograph as base. They only involve
variation selectors VS1..VS3 (U+FE00..U+FE02). Since Unicode 6.3
they have always been listed exhaustively in StandardizedVariants.txt.
However, they are *not* displayed in StandardizedVariants.html.

As of Unicode 7.0, the standardized variation sequences associated
with the CJK compatibility ideographs are also not displayed in
the code charts. The glyphs, however, *are* displayed in the
code charts, because the point of the addition of these
variation sequences was to specify sequences to be interpreted
as those specific glyphs.


Implications

The implications of the current context just outlined are as follows:

A. StandardizedVariants.html does not (and never has) displayed glyphs
for either the IVS (class 1) or the CJK Compatibility (class 2d). So
those CJK sequences are irrelevant to its disposition.

B. The display of the normal and Mongolian variation sequences
(class 2a and 2b) is now routinely done in the code charts, where
they are actually more useful. Therefore, any display of those
sets of variation sequences in StandardizedVariants.html is now
redundant and obsolete.

C. The *only* set of variant glyphs not otherwise accounted for are
the emoji variation sequences (class 2c). Therefore, to obsolete
StandardizedVariants.html itself, one needs only to find a way
to continue presenting these particular emoji variation sequences.


Plan

The simplest way forward for Unicode 9.0 would simply be to remove
the redundant display of class 2a and class 2b variation sequences
in StandardizedVariants.html and to document that change.

However, I think the better way forward is just to take the plunge
and remove the need for further generation and maintenance of
StandardizedVariants.html as of Unicode 9.0. This could be accomplished
by taking the following steps.

1. Create an alternative display vehicle for the emoji variation
sequences (and *only* those sequences). There is a natural
location for such an alternative display vehicle, now that the
charts associated with UTR #51 will be located in:

http://www.unicode.org/emoji/charts/

I suggest that the file be titled: EmojiVariationSequences.html.

The content required is very simple. It need specify only the
base + VS sequences and the associated glyphs. The descriptive
names currently also listed are somewhat redundant. The only
really significant data addition should be a Unicode version
(Age) field indicating when the particular emoji standardized
variation sequence was added to the standard. A possible format
for a line of the table would be:

2747 SPARKLE  U6_1  2747 FE0E {glyph} 2747 FE0F {glyph}

Adding the Unicode version value as a column would render this
page  version-independent and additive. It could be referenced
for all versions of the Unicode Standard, and would only need to be
updated when a decision is made to define new sequences of
this type.

This file could easily be generated by a simple repurposing of
the script already used to generate StandardizedVariants.html,
and using already existing glyphs for it (or if so desired,
updated emoji glyphs from the collection otherwise used for
emoji charts).

2. Completely obsolete StandardizedVariants.html in the UCD.
To keep links stable and confusion down, I suggest that for a couple
versions at least, we simply drop a rump file with the same
name in the UCD, whose content would simply be documentation explaining
where to go find the display of the variant glyphs, with URLs
pointing to the relevant locations.

3. Update the documentation to explain the cutover. This would
require hitting a few places:

3a. Add some additional explanation in Section 23.4, Variation
Selectors, in the core specification.

3b. Add some additional explanation regarding what is and is not
displayed for standardized variation sequences in the code
charts, in Section 24.1, Character Names List, in the core
specification.

3c. Update the documentation in UAX #44 regarding
StandardizedVariants.html and the replacement display vehicle.

3d. Beef up the FAQ entries about variation sequences to provide
current information about where to find all the glyph variant
displays.

3e. Add whatever is necessary to emoji/charts/index.html to
provide context and a link to the new EmojiVariationSequences.html
table as part of the resources there.

3f. Add a short section and link in UTR #51 near the point where
the emoji variation sequences are defined or discussed.