Re: What is the principle?

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Mar 29 2004 - 14:28:23 EST

  • Next message: Séamas Ó Brógáin: "Re: [Slightly OT] Font examiner program/utility?"

    Ernest Cline stated:

    > The standard is quite clear that if a Variation Selector is recognized, but
    > not
    > the sequence it is, then it should be treated the same as if no selector was
    > present.

    Which is true.

    >
    > This is one reason why transferring some or all of the Variation Selectors
    > on the SSP to Private Use is a possibility if they are not going to have
    > any official uses.

    This, however, is distinctly inadvisable, for several reasons.

    First, the 240 Variation Selector characters on Plane 14 were added
    *explicitly* to deal with Han variation issues, which involve
    many, many more possible variants, in some cases, than the
    typical numerosity for the occasional variants notes in other
    scripts.

    Second, the UTC is considering a scheme for dealing with existing
    large collections of Han variants by expliciting dedicating 128
    of those 240 to a preexisting glyph variant registration scheme,
    to move the Han variation problem off dead center (given that the
    task of spelling out exactly what *are* the variants is an enormous
    problem for Han).

    Third, the proposal to "transfer ... some or all of the Variation
    Selectors on the SSP to Private Use" is unclear on the concept of
    Private Use. The UTC will make *no* semantic encoding commitment
    regarding what a private use character is to be used for. That would
    include *not* specifying that some range of Private Use characters
    be dedicated to use as variation selectors (privately defined).
    Anyone who wanted to put in place their own private Idaho of
    two-character encoding for Mende or whatever, could simply define
    that private use space as they wish. Of course they cannot then
    expect automatic rendering (or other) support from standard OS
    interfaces, but that is the fundamental nature of Private Use
    characters.

    Essentially what you seem to be asking for is for the UTC to
    relax the restriction of definition of *variation sequences* --
    i.e. let some of the variation selectors be used on an ad hoc
    basis by consenting adults. But that was *explicitly* ruled out
    by the UTC as a potential barrier to interoperability and because
    it would be an invitation to chaotic glyph encoding.

    > Any Unicode 4.0 compliant software would
    > degrade the presentation of such data gracefully.
    >
    > The only reason I can see for having 256 Variation Selectors is to
    > enable round trip encoding of data using legacy 8 bit character sets
    > that has data which is either invalid or unknown in Unicode.

    Nope. They were introduced for Han.

    > ... I find it
    > doubtful that any non-algorithmic uses of Variation Selectors will
    > require even as many as 16 such selectors for official sequences.

    Some Han sources have lists exceeding 100 variants for a single
    Han "character". Whether the UTC would consider all of those as
    variants of the *same* unified Han character is an open question,
    but the numerosity of such collections is not.

    Asmus said:

    > Therefore I would expect that by default
    > all VS charactesr are ingnored in an fullblown collation implementation,
    > leaving
    > open the choice of supporting, say, a fourth level difference between specific
    > known variation sequences.

    From allkeys.txt, the default data file for the Unicode Collation
    Algorithm:

    FE00 ; [.0000.0000.0000.0000] # [FE00] VARIATION SELECTOR-1

    E0100 ; [.0000.0000.0000.0000] # [E0100] VARIATION SELECTOR-17

    All those zeroes have precisely the effect that Asmus has indicated.
    The variation selectors are ignored completely by the default
    tables for collation.

    Peter Kirk said:

    > Surely Variation Selectors are "default ignorable" characters, which
    > implies that if a process (including collation?) doesn't know what to do
    > with them they should be ignored, i.e. treated as not present rather
    > than as undefined characters.

    From DerivedCoreProperties.txt in the Unicode Character Database:

    FE00..FE0F ; Default_Ignorable_Code_Point # Mn [16] VARIATION
    SELECTOR-1..VARIATION SELECTOR-16
    E0100..E01EF ; Default_Ignorable_Code_Point # Mn [240] VARIATION
    SELECTOR-17..VARIATION SELECTOR-256

    Please read the standard carefully regarding what "default ignorable"
    means. TUS 4.0, p. 142:

    "Default ignorable code points are those that should be ignored by
    default in rendering unless explicitly supported. ..."
               ^^^^^^^^^
               
    Some, like U+00AD SOFT HYPHEN, don't necessarily get the zeroes
    treatment in the default collation table. Some, like U+034F COMBINING
    GRAPHEME JOINER, while getting zero weights in the default table,
    were added explicitly in order to make a potential distinction for
    collation.

    The *essential* concept of default ignorable characters is that
    they consist of the class of characters which, if you don't know
    what their impact on visual rendering is, you are better off
    displaying *nothing* for them, rather than displaying the black
    box (or other blort) indicating the presence of a nondisplayable
    character.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Mar 29 2004 - 15:23:31 EST