RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 04 2003 - 22:22:38 EST

  • Next message: Doug Ewell: "Re: Compression through normalization"

    > If some process using text is sensitive to the *interpretation* of
    > the text, i.e. it is concerned about the content and meaning of
    > the letters involved, then normalization, to forms NFC or NFD,
    > which only involve canonical equivalences, will *not* make a difference.
    > Or to be more subtle about it, it might make a difference, but it
    > is nonconformant to claim that a process which claims it does not
    > make a difference is nonconformant.
    >
    > If you can parse that last sentence, then you are well on the
    > way to understanding the Tao of Unicode.

    Still in the same subject, how do the hold KSX standards for Hanul compare
    each other? If they are upward compatible, ans specify that the conversion
    from an old text not using compound letters to the new standard does not
    mandate their composition into compund jamos, as they are considered
    equivalent there, then there's an issue if a text is converted from the old
    standard set of jamos to Unicode or first converted to the new set and then
    to Unicode.

    In that case Unicode will not treat them as canonically equivalent, despite
    they would have been considered equivalent in the Korean standards. So we
    will find various data containing precomposed jamos for the johad set, and
    other syllables not using them.

    The visual script itself is not altered, but the encoding is different and
    uses alternate decompositions which are not allowed as canonically
    equivalent in Unicode. So let's suppose you have some data coded only with
    single-letter jamos: there will be no way to unify them with modern data
    with Unicode.

    So, unification of these strings will require rearranging the jamos. This is
    an issue for converters, and still an issue within Unicode as single-letter
    jamaos are not deprecated and in fact are necessary for Modern Hangul (they
    are not "compatibility character" and participate to the composition of
    johab syllables for the determination of canonical equivalence).

    If your compressor or transcoder is not allowed to perform any rearrangement
    of jamos for modern Hangul, it should be relaxed for legacy data where jamos
    should have been preferably precomposed before being converted to Unicode.
    Such data will continue to persist for a long time, because it seems so easy
    for a Korean writer to insert or delete a missing single-letter jamos when
    performing corrections (or because of an initial missing keystroke in the
    input method used to compose the text initially).

    Now even if the text seems corrected, there will remain sequences that
    should have been stored with compound jamos. I can imagine a compressor or
    convertor that will however preserve the equivalence of only Hangul text
    using at best all the compound jamaos where ever they exist, and keep it in
    NFC form. For other texts, the jamos will be recomposed as they should have
    been, but canonical equivalence will not be preserve. But that will make all
    its possible to compress jamos corectly.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 04 2003 - 23:21:20 EST