From: Kenneth Whistler (email@example.com)
Date: Mon Mar 29 2004 - 14:28:23 EST
Ernest Cline stated:
> The standard is quite clear that if a Variation Selector is recognized, but
> the sequence it is, then it should be treated the same as if no selector was
Which is true.
> This is one reason why transferring some or all of the Variation Selectors
> on the SSP to Private Use is a possibility if they are not going to have
> any official uses.
This, however, is distinctly inadvisable, for several reasons.
First, the 240 Variation Selector characters on Plane 14 were added
*explicitly* to deal with Han variation issues, which involve
many, many more possible variants, in some cases, than the
typical numerosity for the occasional variants notes in other
Second, the UTC is considering a scheme for dealing with existing
large collections of Han variants by expliciting dedicating 128
of those 240 to a preexisting glyph variant registration scheme,
to move the Han variation problem off dead center (given that the
task of spelling out exactly what *are* the variants is an enormous
problem for Han).
Third, the proposal to "transfer ... some or all of the Variation
Selectors on the SSP to Private Use" is unclear on the concept of
Private Use. The UTC will make *no* semantic encoding commitment
regarding what a private use character is to be used for. That would
include *not* specifying that some range of Private Use characters
be dedicated to use as variation selectors (privately defined).
Anyone who wanted to put in place their own private Idaho of
two-character encoding for Mende or whatever, could simply define
that private use space as they wish. Of course they cannot then
expect automatic rendering (or other) support from standard OS
interfaces, but that is the fundamental nature of Private Use
Essentially what you seem to be asking for is for the UTC to
relax the restriction of definition of *variation sequences* --
i.e. let some of the variation selectors be used on an ad hoc
basis by consenting adults. But that was *explicitly* ruled out
by the UTC as a potential barrier to interoperability and because
it would be an invitation to chaotic glyph encoding.
> Any Unicode 4.0 compliant software would
> degrade the presentation of such data gracefully.
> The only reason I can see for having 256 Variation Selectors is to
> enable round trip encoding of data using legacy 8 bit character sets
> that has data which is either invalid or unknown in Unicode.
Nope. They were introduced for Han.
> ... I find it
> doubtful that any non-algorithmic uses of Variation Selectors will
> require even as many as 16 such selectors for official sequences.
Some Han sources have lists exceeding 100 variants for a single
Han "character". Whether the UTC would consider all of those as
variants of the *same* unified Han character is an open question,
but the numerosity of such collections is not.
> Therefore I would expect that by default
> all VS charactesr are ingnored in an fullblown collation implementation,
> open the choice of supporting, say, a fourth level difference between specific
> known variation sequences.
From allkeys.txt, the default data file for the Unicode Collation
FE00 ; [.0000.0000.0000.0000] # [FE00] VARIATION SELECTOR-1
E0100 ; [.0000.0000.0000.0000] # [E0100] VARIATION SELECTOR-17
All those zeroes have precisely the effect that Asmus has indicated.
The variation selectors are ignored completely by the default
tables for collation.
Peter Kirk said:
> Surely Variation Selectors are "default ignorable" characters, which
> implies that if a process (including collation?) doesn't know what to do
> with them they should be ignored, i.e. treated as not present rather
> than as undefined characters.
From DerivedCoreProperties.txt in the Unicode Character Database:
FE00..FE0F ; Default_Ignorable_Code_Point # Mn  VARIATION
E0100..E01EF ; Default_Ignorable_Code_Point # Mn  VARIATION
Please read the standard carefully regarding what "default ignorable"
means. TUS 4.0, p. 142:
"Default ignorable code points are those that should be ignored by
default in rendering unless explicitly supported. ..."
Some, like U+00AD SOFT HYPHEN, don't necessarily get the zeroes
treatment in the default collation table. Some, like U+034F COMBINING
GRAPHEME JOINER, while getting zero weights in the default table,
were added explicitly in order to make a potential distinction for
The *essential* concept of default ignorable characters is that
they consist of the class of characters which, if you don't know
what their impact on visual rendering is, you are better off
displaying *nothing* for them, rather than displaying the black
box (or other blort) indicating the presence of a nondisplayable
This archive was generated by hypermail 2.1.5 : Mon Mar 29 2004 - 15:23:31 EST