RE: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jan 20 2004 - 14:27:09 EST

  • Next message: Elliotte Rusty Harold: "Re: Unicode forms for internal storage"

    John Jenkins tried to present some usage cases for Han FVS
    combinations, and Mike Ayers responded with a bunch more questions:

    > Ummm - if this simplified form were used at all, wouldn't it already
    > be encoded? Isn't there a process for getting such encoded? Has this
    > process broken down, or have some of its assumptions been shown invalid?

    If that simplified form were used at all, it would be *in use*, not
    necessarily encoded. Not all Chinese printed material has gone
    through a computer encoding to be set in type, and even material
    that is represented via computerized typesetting may have been
    set in fonts that apply regular simplification rules to some glyphs
    that may not actually occur in the GB standards for these things.

    > Huh? You forgot the part about "the font designer psychically
    > already knew how Mr. Turtle draws his name and encoded the glyph for it,

    The fact is that thousands of such oddball variants already *do* exist
    in print, which means that some "font designer" someplace already did
    so. Well, the instance in "print" may actually be a handwritten or
    carved form. They are less likely to occur at random in modern computer
    fonts, but even there, more or less random collection of "gaiji" get
    added to the fonts and then may be used in one context or another.

    > ... Are you saying
    > that there is a known limit to the number of character variants, and that
    > there is an establishable correspondence between these variants such that a
    > logical connection between a variant and one of a set of FSV is possible?
    > Call me skeptical...

    The real problem that the committee is dealing with is that there are
    a number of significant collections of such kinds of variants,
    particularly in Japan. And ways need to be found to interoperate with
    software that implements such lists, lest de facto alternate
    encodings spring up that would undermine the case for universal usage of
    Unicode in East Asia.

    To date, extensions to Unicode including variants of already-encoded
    characters, have ended up just being the adding of more variants as
    "unified" Han characters. But carried too far, that dilutes the
    identity of the core character itself.

    The alternative being investigated is to consider such things as
    turtle-variant-17 to simply be representable by a sequence such
    as <2A6C9, E0180>, rather than having to add yet *another* variant
    turtle character on its own.

    > Whoa, Nellie!
    >
    > Did "represent newly discovered characters" creep into the mission
    > statement of plain text when I wasn't looking?

    This has *always* been part of the agenda of the encoding committees.

    If you are representing Han data as Unicode plain text, and you
    run into a "newly discovered character", you are stuck. Your options
    are:

      1. Use a "geta" (U+3013), i.e. throw up your hands and punt.
      
      2. Use an Ideographic Description Sequence to get an approximate
         description as a substitute.
         
      3. Ask the character encoding committees to encode the character
         (a process that will take a long while).
         
      4. Ask the character encoding committees to make the character
         representable by a designated variation sequence (a process
         that also make take a long while, but which could shortcircuit
         things considerably if the known lists of these things were
         all processed ahead of time).
         
    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:10:00 EST