Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)

From: John Jenkins (
Date: Wed Jan 21 2004 - 13:13:33 EST

  • Next message: Rick McGowan: "Reminder: open Public Review Issues"

    On Jan 21, 2004, at 6:36 AM, Andrew C. West wrote:

    > If a simplified form of a given CJK ideograph is used, then it
    > deserves encoding
    > properly. There are newly-coined simplified forms in CJK-B and CJK-C,
    > so why not
    > add newly used simplified forms to CJK-C or whereever if they are
    > really needed
    > ? To borrow Michael's term, this use of variation selectors is simply
    > pseudo-coding.

    Well, first of all, there were a *lot* of mistakes made in Extension B.
      And Extension C isn't encoded yet. The UTC intends to lobby WG2 to do
    the encoding of such forms via variation selectors.

    The whole point of using variation selectors is that the line between
    character and glyph can sometimes be a fuzzy one, and Han is probably
    the worst case. In the case of TC and SC, it's just as easy (in many
    cases, where there's a one-one, algorithmic relationship) to see the
    two forms as glyphic avatars of a single, Platonic character. Such a
    representation, via variation selectors, aids a number of processes,
    such as fuzzy searching, text-to-speech, and so on, because you don't
    require new tables to do a match.

    Indeed, right now I have to periodically run checks on the Unihan
    database to make sure that TC/SC pairs have the same readings. It's a

     From an end-user perspective, there is *NO DIFFERENCE* between
    representing these characters using variation selectors and direct
    encoding. They can show up in input methods and fonts just the same.

    > 1. Unicode Design Principle 3 : "The Unicode Standard encodes
    > characters, not
    > glyphs."
    > This is simple glyph variant. I insist on writing the "A" in my name
    > with two
    > cross-bars. Will the UTC kindly accommodate me by providing an
    > appropriate
    > standardised variant for U+0041 ? (In fact, come to think of it I have
    > idiosyncratic ways of writing all of the letters in my name ...)

    Well, a personal name ideograph is perhaps not the best example, since
    the size of the "personal name" problem is unknown. IIRC nobody's won
    Rick's contest yet. The goal was to come up with an instance where
    some people make a distinction and others don't. In any event, the
    example is not entirely tongue-in-cheek. First of all, all three of my
    Cantonese-English dictionaries contain a variant turtle ideograph which
    isn't encoded yet. (I haven't looked in Extension C, BTW.) Secondly,
    the original Korean proposal for Extension C contained literally dozens
    of variant turtle ideographs.

    The difficulty here -- and this leads into the third example -- the
    Koreans derived their characters from a soft copy of the Korean
    tripitaka. Now, I would assert that these variant turtles are probably
    just variant turtles, chosen idiosyncratically by the scribe for
    whatever reason. (Rather the way that 16th and 17th century English
    books have fairly random and inconsistent spelling.) If it is
    absolutely necessary to embody this variation, it would be better to
    use rich text. Unfortunately, it's impossible to know for certain
    whether this is the case or not, and so variation selectors are
    available to make a distinction possible in plain text for those who
    care about it.

    Granted, epigraphy is tough on plain text. As Unicode starts to deal
    with dead scripts, we have to deal with the issues it raises.
    Variation selectors are one way of doing it.

    > The plain fact of the matter is that the *character* turtle is already
    > encoded,
    > and if someone wants to use a different glyph form for this character
    > then he or
    > she should design their own font with the appropriate glyph mapped to
    > U+9F9C or
    > U+9F9F.

    Or any of the other turtles we already have.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 15:00:45 EST