Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Wed Jan 21 2004 - 08:36:54 EST

  • Next message: Andrew C. West: "Re: Mongolian Unicoding (was Re: Cuneiform Free Variation Selectors)"

    On Tue, 20 Jan 2004 10:32:06 -0700, John Jenkins wrote:
    >
    > 1) U+9CE6 is a traditional Chinese character (a kind of swallow)
    > without a SC counterpart encoded. However, applying the usual rules
    > for simplifications, it would be easy to derive a simplified form which
    > one could conceivably see in a book printed in the PRC. Rather than
    > encode the simplified form, the UTC would prefer to represent the SC
    > form using U+9CE6 + a variation selector.
    >

    If a simplified form of a given CJK ideograph is used, then it deserves encoding
    properly. There are newly-coined simplified forms in CJK-B and CJK-C, so why not
    add newly used simplified forms to CJK-C or whereever if they are really needed
    ? To borrow Michael's term, this use of variation selectors is simply
    pseudo-coding.

    If a Chinese publishing house were going to print a book in simplified
    characters that included a simplified form of U+9CE6, would they go the lengths
    of applying to Unicode to define an appropriate standardised variant for U+9CE6,
    and then trying to create a font that implemented variation selectors ? Or would
    they simply use a font that mapped a simplified glyph form to U+9CE6 (or the
    PUA) ? If it is so important to formally define the existence of a simplified
    form of an existing character, then why not encode it properly ??

    > 2) Your best friend has the last name of "turtle," but he doesn't use
    > any of the encoded forms for the turtle character to represent it. He
    > insists on writing it in yet another way and wants to be able to
    > include his name as he writes it in the source code he edits. The UTC
    > ends up accommodating him using U+2A6C9 (which is the closest turtle to
    > his last name) + a variation selector.

    1. Unicode Design Principle 3 : "The Unicode Standard encodes characters, not
    glyphs."
    This is simple glyph variant. I insist on writing the "A" in my name with two
    cross-bars. Will the UTC kindly accommodate me by providing an appropriate
    standardised variant for U+0041 ? (In fact, come to think of it I have
    idiosyncratic ways of writing all of the letters in my name ...)

    The plain fact of the matter is that the *character* turtle is already encoded,
    and if someone wants to use a different glyph form for this character then he or
    she should design their own font with the appropriate glyph mapped to U+9F9C or
    U+9F9F.

    2. Unicode does not encode private-use characters.
    I can't find chapter and verse for it, but I was always under the impression
    that Unicode did not encode private-use characters.

    > 3) You're editing a critical edition of an ancient MS, and you find
    > that your author, who talks a lot about handkerchiefs, uses U+5E28
    > quite a bit, but varies between the "ears-in" form and the "ears-out"
    > form almost at random. Rather than lose the distinction which *may* be
    > meaningful, you (with the UTC's blessing) use U+5E28 for the ears-in
    > form (as Unicode uses) and U+5E28 + a variation selector for the
    > ears-out form.

    This example actually opens up the biggest can of worms.

    As someone who has a passion for transcribing ancient manuscripts, in Chinese
    and other scripts, I fully appreciate the desire to be able to represent every
    little idiosyncrasy of a manuscript or inscription in plain text Unicode. But
    the simple fact of the matter is that you can't. My apologies for repeating
    myself, but Unicode Design Principle 3 states that "The Unicode Standard encodes
    characters, not glyphs." (and Section 2.2 of TUS elaborates on this statement).

    Unless Unicode becomes a Glyph Encoding Standard instead of a Character Encoding
    Standard, then how on earth can the UTC allow VSs to be used for simple glyph
    variants ? And if it's OK for CJK ideographs, then why not for every other
    Unicoded script ?

    Glyph variations are of paramount interest to textual scholars and epigraphers
    of all scripts, not just Chinese. To take a random example from the Celtic
    Inscribed Stones Project (CISP), this is a palaeographgic description of a cross
    slab at Kirk Maughold in the Isle of Man, inscribed [--]I IN CHRISTI NOMINE
    CRUCIS CHRISTI IMAGENEM :

    Kermode/1907, 112: `we have here the diamond-shaped O, the N like an H, and the
    M like a double H, all characteristics of the Hiberno-Saxon manuscripts and
    sculptured stones of the period. Other characteristic forms are the
    square-shaped C and the peculiar G, the like of which I have not seen elsewhere.
    But some of the letters are minuscules, as p, d, b, r, and a; while in the
    contraction for CHRISTI, in each case the R differs from the ordinary small R in
    CRUCIS, representing, in fact, the Greek Rho!'.

    [http://www.ucl.ac.uk/archaeology/cisp/database/stone/maugh_4.html]

    If we go down the road of encoding epigraphic and palaeographic glyph variants
    for CJK and other scripts I'm afraid that we'll soon find that 256 Variation
    Selectors just isn't enough.

    Andrew



    This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 10:11:30 EST