Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)

Date: Fri Nov 02 2007 - 06:22:56 CST

  • Next message: Andrew West: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"

    Quoting Andrew West <>:

    > On 01/11/2007, John H. Jenkins <> wrote:
    >> > If you were going to ask me what the "best" way to represent kanji
    >> > ligatures such as <U+2FF5 U+9580 U+9F8D> would be under an ideal
    >> > Unicode model, I would say as <U+9580 U+200D U+9F8D>, using ZWJ to
    >> > indicate the ligation, and smart fonts would ligate the two components
    >> > into a single glyph if they could.
    >> Actually, do it without the ZWJ, which would break the IDS syntax.
    >> Just make the ligature on by default.
    > To clarify, in my ideal world IDS sequences would not be composable
    > into a single glyph by fonts, but would always be rendered as a
    > sequence of IDC and ideographic characters. I would use ZWJ for
    > hanzi/kanji ligation without any IDC characters. The obvious
    > disadvantage to this is that it does give the font any clues as to
    > what the character should look like, but that is true for all scripts
    > that have ligatures. In the case of simple kanji ligatures the
    > resultant glyph is usually self-evident, but in any case font
    > designers would probably have to know which particular kanji ligatures
    > they wanted to support in the first place.
    > The beauty of the ZWJ model (or evilness of the model, depending on
    > your point of view) is that an A-ZWJ-B ligature may look exactly the
    > same as a B-ZWJ-A ligature but would be treated as distinct entities.
    > Thus, if someone wanted to create a ligature of U+9F8D ? long2
    > "dragon" U+9580 ? men2 "gate" as cute way of writing Longmen ??
    > "Dragon's Gate", with U+9F8D inside U+9580 they could do so with the
    > sequence <U+9F8D U+200D U+9580> (representing the logical order of
    > the ligatured characters). This would render the same as Ben's
    > <U+9580 U+200D U+9F8D>, but would be treated differently by search
    > engines, etc.

    Yes though the question is of course what is obvious cf

    U+9584 &#38276;
    U-00021B89 &#138121;

    > Incidentally, if Ben does want to find evidence for <U+2FF5 U+9580
    > U+9F8D> that will satisfy UTC and WG2 then my suggestion is that he
    > trawls through the corpus of literature relating to the Longmen
    > Grottoes <> and ancient
    > descriptions of walled cities with gates named Longmen -- I'm sure
    > that someone sometime somewhere must have already created the
    > character as a shorthand for <U+9F8D U+9580>. The thing that really
    > surprises me is that it is not already encoded, when we have
    > characters such as:

    There are literally thousand, tens of thousands of very simple
    characters not encoded. The simplest ones I can think of have only
    four strokes to them. The most well known being the rectangle with a
    vertical line; and my favourite consists of U+5B50 &#23376; U+529B
    &#21147;, Zhuang lwg meaning child.

    Can anyone think of a 3 stroke character that is on the list of to be encoded?

    The point of the above being if even fairly common four stroke
    characters are yet to be enoced there should be no suprize that
    <U+9F8D U+9580> has not.


    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 06:25:28 CST