L2/10-306

Subject:     Comments on the revised proposal to encode variation sequences for Latin and Cyrillic
Date:     Sun, 08 Aug 2010
From:     Asmus Freytag


Karl's proposal in L2/10-280 raises a number of interesting issues around the glyph selection problem caused by the unification process inherent in character encoding.

His analysis of these issues is focused on the status quo, but even with the advancement in specifications cited in Peter Constable's comments, the fact remains that there are some interesting gaps that could be filled with a carefully considered set of additional variation sequences. This set will most likely turn out to be quite different than the actual set proposed, but rather than dismissing document L2/10-280 I encourage the UTC to see it as a starting point for a necessary discussion on how to plug some remaining holes that even new technology won't be able to fix (after it's implemented).

There are three aspects to the proposal that I will take up in turn.

1) Documenting variation sequences

Karl is correct in identifying the weakness of the current documentation method (standardizedvariants.html). For logistic reasons, documenting variation sequences (as well as named sequences) in the code charts appears not feasible, unfortunately. As nice as it would be. However, annotations could be placed in the charts to point to the fact that variation sequences exist. I recommend that UTC accept the intent of this part of the proposal (better, more visible documentation).

2) Integrating Variation Sequences and Named Sequences

Variation sequences came before named sequences as the Unicode Standard evolved. Now that named sequences exist, one should ask the question of whether it makes sense to continue the practice of using a separate style of "description" for variation sequences, when they might as well be named. I recommend to the UTC to seriously investigate whether these "naming" schemes cannot be unified.

L2/10-280 suggest that named sequences be based on normative aliases, where they exist. I believe that is a useful practice to follow even for regular named sequences - especially wherever the aliases were added to fix a spelling error in the original character name. I recommend that the UTC adopt this element of the proposal.

About the other suggestions for creating names for variation sequences in particular, I can see the utility of marking variation sequences in their name to distinguish them from combining and other character sequences, but the proposed scheme looks clumsy. It might be better to simply uppercase the existing descriptions (after removing anything that violates character name syntax).

3) The need for additional variation sequences

Discussing this part of L2/10-280 requires a bit of background:

When Unicode was first contemplated, the web did not exist and the expectation was that it would nearly always be possible to specify the font to be used for a given text and that selecting a font would give the correct glyph.

As the proposal noted, universal fonts and viewing documents on other platforms and systems across the web have made this solution unattractive for general texts, up to now.

We are left then with these scenarios

1) Free variation
2) Orthographic variation of isolated characters (by language, e.g. different capitals)
3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, by language)
4) Notational conventions (e.g. IPA)

Free variation
--------------
For free variation of a glyph, the only possible solutions are either font selection or use of a variation sequence. I concur with Karl, that in this case, where notable variations have been unified, that adding variation selectors is a much more viable means of controlling authorial intent than font selection. The problem here is twofold: there is no standardized way to select a particular free variation (i.e. one that cannot be deduced from the language of the surrounding text) in the realm of higher-level protocols, other than by creating a font containing that variant.

I didn't spot an example for this case in cursory reading of L2/10-280, but I recommend to the UTC to remain open to encode variation sequences whenever glyphs are unified where the use depends on such free variation.

Orthographic variations
-----------------------
If text is language tagged, then OpenType mechanisms exist  in principle to handle scenario 2 and 3. For full texts in a certain language, using variation selectors throughout is clearly unappealing as a solution, compared to the OpenType and CSS features documented by Peter Constable and Martin Dürst. This would appear to void the rational for most of the variation sequences proposed in L2/10-280.

However, the use of variation selectors could be a viable solution for being able to embed correctly rendered citations in other text, given that language tagging can (and will) be separated from a document and that automatic language tagging may detect large chunks of text, but not short runs.

Defining variation selectors for these instances would be entirely analogous to the "exceptional" use of the free variation selectors for Mongolian. Context (and locale settings) would take care of the ordinary case, variation selectors would exist as a tool for situations where context and locale settings cannot be used.

I recommend that the UTC investigate (or ask the proposer to investigate) to what degree there is a need for such means to select variations that are out of context.

Notational conventions
----------------------
Many notations give semantic value to particular glyphic appearance of a character. Such notational conventions are addressed in Unicode by duplicate encoding (IPA) or by variation sequences. The scheme has holes, in that it is not possible in a few cases to select one of the variants explicitly, instead, the ambiguous form has to be used, in the hope that a font is used that will have the proper variant in place for the ambiguous form.

For IPA, for example, the a "without a handle" is encoded at U+0251, while for the "a with handle" (or two storey a in the notation of L2/10-280) one must use U+0061, which encodes both shapes ambiguously. This contrasts with the way Unicode dealt with other ambiguous characters, such as the HYPHEN-MINUS, for which both forms, the HYPHEN and the MINUS form exist as characters.

Adding a limited set of variation sequences (for example one that explicitly requests the "a" at 0061 to be the two story one needed for IPA) would fill the gap for times when controlling the precise display font is not available.

A particular case, where this would be highly valuable are the Greek characters for which variant forms are encoded as mathematical symbols (unfortunately L2/10-280 restricts itself to Latin and Cyrillic and therefore missed these cases).

In order to contrast with U+03D5 GREEK PHI SYMBOL which is always the straight form, a font must implement  U+03C6 GREEK SMALL LETTER PHI as the loopy form. However, normally, the choice of which form to use of U+03C6 follows from the general type design, and varies systematically between serifed and sans-serifed forms.

Adding a variation selector here could plug this hole, and once implemented, could lead to a wider range of fonts supporting the necessary distinction.

I recommend to the UTC to endorse the principle that where such asymmetric encodings exist in notational context, that variation sequences are to be defined to allow fully specified encoding of semantic variants for notational purposes.