RE: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)

From: Philippe Verdy (
Date: Fri Nov 02 2007 - 20:03:55 CST

  • Next message: Bala: "RE: Tamil Sri / Shri"

    James Kass wrote:
    > Andrew West wrote,
    > > The beauty of the ZWJ model (or evilness of the model, depending on
    > > your point of view) is that an A-ZWJ-B ligature may look exactly the
    > > same as a B-ZWJ-A ligature but would be treated as distinct entities.
    > > Thus, if someone wanted to create a ligature of U+9F8D 龍 long2
    > > "dragon" U+9580 門 men2 "gate" as cute way of writing Longmen 龍門
    > > "Dragon's Gate", with U+9F8D inside U+9580 they could do so with the
    > > sequence <U+9F8D U+200D U+9580> (representing the logical order of
    > > the ligatured characters). This would render the same as Ben's
    > > <U+9580 U+200D U+9F8D>, but would be treated differently by search
    > > engines, etc.
    > Are you sure they would both render the same?

    Certainly, specifying a ligature with ZWJ will not be sufficient, as it does
    not indicate the type of "ligature" performed: enclosure of the second
    ideograph within the first one, or superposition of smaller sizes, or
    juxtaposition of narrowed ideographs within the same square.

    ZWJ would be even less useful than using IDC in IDS, notably if you want it
    to not specify the relative order (something completely against the
    philosophy or Unicode that wants a logical ordering based on semantics (or
    order of syllables in the composed square).

    And even in that case, the IDS encoding order is not necessarily the logical
    order, or the components have been changed from their original semantic by
    possibly transforming one component by another simpler one (quite frequent
    in simplified Chinese and many modern compositions for multisyllabic
    ideographs based on ideographs used and interpreted for their syllabic
    value, such as composite ideographs created after transliterations or
    personal names).

    In some compositions, the layout is not necessarily logical (does not follow
    the default ordering implied by the IDS syntax) but is rearranged for
    practical or typographical reasons, or for readability (this has also
    occured in some old Hangul compositions as well, before some new letters
    were created; similar reasons explain variations in the placement of some
    diacritics in alphabetic scripts as well, including Latin and Greek, or in
    some abjads like Hebrew...)

    For this reason, I do think that ZWJ is not very suitable for that work,
    TUS-IDS are better, but still insufficient... It may be a reason why PRC
    insists on encoding ideographs without trying to decompose them, to make
    sure that the semantic is preserved or non-ambiguous for the common words or
    syllables. However there still remains a problem with newly created
    ideographs that are polysyllabic in nature: they are real ligatures, but
    their layout is not always logical and there's a conflict between the IDS
    syntax that just describes the basic layout in a fixed reading and encoding
    order, and the semantic logical order:

    For example, if there are some composed characters whose logical order is
    from bottom to top, instead of top-to-bottom, the IDS will not describe this
    correctly. If this ever occurs, will there be variants for the vertical
    composition IDC? If some traits of one component is moved on another
    relative place or removed, how will you encode it: according to IDS you
    would break the semantic as the initial non composed ideograph would no
    longer be there?

    This archive was generated by hypermail 2.1.5 : Sat Nov 03 2007 - 10:36:25 CST