RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Philippe Verdy (
Date: Tue Nov 25 2003 - 12:02:08 EST

  • Next message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

    John Cowan writes:
    > > You are, because the floodgates, while once open, have been closed by
    > > normalization.
    > Indeed, they were opened in Unicode 1.1, as a result of the merger with
    > FDIS 10646; since then, only 46 characters with canonical decompositions
    > have been added to Unicode (excepting compatibility ideographs, which
    > are a special case).

    In fact ISO10646 is to allow an easy one-to-one mapping from existing
    standard coded character sets (CCS) and unified code points. Accepting
    precomposed characters is then a necessity when there exists precomposed
    characters in legacy CCS standard. But they are included only for
    compatibility (exactly like for compatibility ideographs).

    The question of Latin letters with two diacritics added in Latin Extension B
    does not seem to respect this constraint, as it is not justifed in the
    Vietnames VISCII standard that already does not contain characters with two
    diacritics, but already composes them with two characters in the limited CCS
    I don't know why even ISO10646 would have needed them, unless there's some
    Vietnamese DBCS standard that allows representing in a 94x94 matrix all
    letters with two diacritics as well as Han ideographs used in Vietnamese. I
    looked within the IBM database of charsets (CCS+CES), and could not find
    such reference to such EUC-style DBCS. So was it because there was an
    ongoing/unterminated DBCS standard for Vietnamese, working like GBK, SJIS or
    KSC 5601 ?

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 13:05:21 EST