Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)

From: Andrew West (
Date: Fri Nov 02 2007 - 04:51:46 CST

  • Next message: "Re: Codespace Anxiety Redux (was: Re: Level of Unicode support required ...)"

    On 01/11/2007, John H. Jenkins <> wrote:
    > > If you were going to ask me what the "best" way to represent kanji
    > > ligatures such as <U+2FF5 U+9580 U+9F8D> would be under an ideal
    > > Unicode model, I would say as <U+9580 U+200D U+9F8D>, using ZWJ to
    > > indicate the ligation, and smart fonts would ligate the two components
    > > into a single glyph if they could.
    > Actually, do it without the ZWJ, which would break the IDS syntax.
    > Just make the ligature on by default.

    To clarify, in my ideal world IDS sequences would not be composable
    into a single glyph by fonts, but would always be rendered as a
    sequence of IDC and ideographic characters. I would use ZWJ for
    hanzi/kanji ligation without any IDC characters. The obvious
    disadvantage to this is that it does give the font any clues as to
    what the character should look like, but that is true for all scripts
    that have ligatures. In the case of simple kanji ligatures the
    resultant glyph is usually self-evident, but in any case font
    designers would probably have to know which particular kanji ligatures
    they wanted to support in the first place.

    The beauty of the ZWJ model (or evilness of the model, depending on
    your point of view) is that an A-ZWJ-B ligature may look exactly the
    same as a B-ZWJ-A ligature but would be treated as distinct entities.
    Thus, if someone wanted to create a ligature of U+9F8D 龍 long2
    "dragon" U+9580 門 men2 "gate" as cute way of writing Longmen 龍門
    "Dragon's Gate", with U+9F8D inside U+9580 they could do so with the
    sequence <U+9F8D U+200D U+9580> (representing the logical order of
    the ligatured characters). This would render the same as Ben's
    <U+9580 U+200D U+9F8D>, but would be treated differently by search
    engines, etc.

    Incidentally, if Ben does want to find evidence for <U+2FF5 U+9580
    U+9F8D> that will satisfy UTC and WG2 then my suggestion is that he
    trawls through the corpus of literature relating to the Longmen
    Grottoes <> and ancient
    descriptions of walled cities with gates named Longmen -- I'm sure
    that someone sometime somewhere must have already created the
    character as a shorthand for <U+9F8D U+9580>. The thing that really
    surprises me is that it is not already encoded, when we have
    characters such as:

    U+49B0 䦰 gate + tortoise/turtle
    U+95A9 閩 gate + insect
    U+95D6 闖 gate + horse
    U+28CEF 𨳯 gate + ox
    U+28D2F 𨴯 gate + pig
    U+28D58 𨵘 gate + tiger
    U+28D5C 𨵜 gate + frog
    U+28D85 𨶅 gate + lamb
    U+28D87 𨶇 gate + crow
    U+28DA0 𨶠 gate + bird
    U+28DA2 𨶢 gate + fish
    U+28DCD 𨷍 gate + tortoise/turtle
    U+28DDF 𨷟 gate + tortoise/turtle
    U+28DF7 𨷷 gate + insect
    U+28DFA 𨷺 gate + tortoise/turtle


    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 04:54:32 CST