Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 03 2003 - 11:32:07 EST

  • Next message: Stefan Persson: "Re: MS Windows and Unicode 4.0 ?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > I still think that we could try to use only LV syllables but not LVT
    > syllables to reduce the set of Hangul character used if this helps
    > the final compressor.

    Aha, LV syllables. Now we are talking about something that exists and
    can be used in the manner you describe. It won't help SCSU or BOCU-1
    compression, but it might improve the performance of a Huffman or
    arithmetic implementation that can handle more than 256 characters, as
    you stated below.

    > It's true that the LV syllables are discontinuous in the large Hangul
    > johab syllable block. But it could reduce the number of needed codes
    > in compression lookup dictionnaries and would limit the number of
    > table resets by exhausting less often the lookup table, and it would
    > also allow finding compressable similarities in the text stream at
    > much shorter distances than within a text using a lot of LVT
    > syllables. So the impact of the spreaded LV syllables in the johab
    > set would still be low.

    True. Don't try it with SCSU, though, because you'd be constantly
    jumping between single-byte and Unicode mode (or using four bytes for
    every LVT syllable). And don't try it with BOCU-1, because every switch
    between the jamos block and the syllable block will cost three bytes.

    > I will retry to compress Korean by using NFC form modified by
    > excluding LVT johab syllables but only keeping LV johab syllables and
    > separate L or V or T jamos...

    UAX #15 includes sample Java code showing, among other things, how to
    compose an LV syllable plus a T jamo into a syllable. It would be
    relatively easy to reverse the logic, though of course the UAX does not
    show that because it is neither NF(K)C nor NF(K)D.

    Speaking of which, I just noticed that the function in SC UniPad to
    compose syllables from jamos does not handle this case (LV + T = LVT).
    I'll have to report that to the UniPad team.

    > I just have another question for Korean: many jamos are in fact
    > composed from other jamos: this is clearly visible both in their name
    > and in their composed glyph. What would be the linguistic impact of
    > decomposing them (not canonically!)? Do Korean really learn these
    > jamos without breaking them into their components? I think here about
    > SSANG (double) consonnants, or the initial Y or final E of some
    > vowels...

    This would be a good question for Jungshik or another native Korean. I
    have read that Korean children learn the syllables as whole units,
    rather than as an arrangement of jamos as I would see them, leading some
    to think of Hangul as a featural syllabary instead of an alphabet.

    > Of couse I won't be able to use such decomposition in Unicode, but
    > would it be possible to use it in some private encoding created with a
    > m:n charset mapping from/to Unicode?

    You can do absolutely anything you like in a private encoding. Bernard
    Miller did:

    http://www.bytext.org/

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 12:22:46 EST