RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 06:52:22 EST

  • Next message: Arcane Jill: "RE: MS Windows and Unicode 4.0 ?"

    Doug Ewell writes:
    > I just read C10 again and noticed that it says that character sequences
    > can be replaced by canonical-equivalent sequences -- NOT that they have
    > to end up in a particular normalization form. So your strategy of
    > converting to a form halfway between NFC and NFD seems acceptable.
    > However, the sequences still have to be correct. You can't invent your
    > own equivalences, which is what I think you are doing by calling U+110B
    > a filler and then using it to create "VT syllables."

    That's not a problem of C10 interpretation which I understand (I'm not
    thinking about inserting any additional character, just that I did a false
    interpretation of a normative "empty" unicode syllable name for IEUNG).

    I still think that we could try to use only LV syllables but not LVT
    syllables to reduce the set of Hangul character used if this helps the final
    compressor. It's true that the LV syllables are discontinuous in the large
    Hangul johab syllable block. But it could reduce the number of needed codes
    in compression lookup dictionnaries and would limit the number of table
    resets by exhausting less often the lookup table, and it would also allow
    finding compressable similarities in the text stream at much shorter
    distances than within a text using a lot of LVT syllables. So the impact of
    the spreaded LV syllables in the johab set would still be low.

    I will retry to compress Korean by using NFC form modified by excluding LVT
    johab syllables but only keeping LV johab syllables and separate L or V or T
    jamos...

    I just have another question for Korean: many jamos are in fact composed
    from other jamos: this is clearly visible both in their name and in their
    composed glyph. What would be the linguistic impact of decomposing them (not
    canonically!)? Do Korean really learn these jamos without breaking them into
    their components? I think here about SSANG (double) consonnants, or the
    initial Y or final E of some vowels...
    Of couse I won't be able to use such decomposition in Unicode, but would it
    be possible to use it in some private encoding created with a m:n charset
    mapping from/to Unicode?

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 07:42:33 EST