RE: Compression through normalization

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Dec 03 2003 - 06:27:39 EST

  • Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"

    Jungshik Shin writes:
    > > > I already answered about it: I had mixed the letters TLV instead of
    > > > LVT. All the above was correct if you swap the letters. So what I did
    > > > really was to compose only VT but not LV nor LVT:
    > > >
    > > > ( ((L* V* VT T*) - (L* V+ T)) | X )*
    > > >
    > > > I did it by using a leading filler (U+110B) to represent VT as an LVT
    > > > syllable...
    > >
    > > But U+110B isn't a filler, it's a real letter, IEUNG. If you want a
    > > choseong filler, you have to use U+115F. IEUNG is not equivalent to a
    > > filler and can't be used to construct a so-called "VT syllable." For
    > > example, (U+1100 + U+C544) is not equal to U+AC00.
    >
    > Doug is right. Philippe appears to have been confused by the fact that
    > phonetically U+110B IEUNG is 'null-consonant' (the place holder
    > for syllables
    > that begin with a vowel). In Unicode-sense, however, U+110B is not
    > a filler but as genuine a letter as any other leading consonants are.

    Oops. I should have read that part better. So my test was giving wrong
    results (even if I knew it was not producing canonically equivalent strings
    I thought it was safe by looking at the list of unicode names generated from
    the compressor, because I don't know that language...)
    I do need to reread chapter 11.4... Which allows composing 19 leading
    consonnant jamos, 21 medial vowels jamos (399 johab syllables), and
    optionally 27 trailing consonnants jamos (10773 johab syllables). Plus
    section 3.12 for conforming conjoining behavior of jamos.

    I knew that there was a choseong filler in the leading consonnants and
    rechecking it, you're right that this is not U+110B but U+115F. I wonder if
    there's a way to use it to encode a VT syllable separately from the leading
    consonnant jamo that normally starts all modern Korean. I fear not, because
    johab syllables can only start by a choseong in U+1100 to U+1112.

    That's a place where the codecharts for Hangul jamos should exhibit more
    precisely the 3 subsets of jamos usable for johab syllables, because I just
    looked at the normative name of Hangul syllables to check my compression
    attempt, and I did not see that I was in fact breaking the text by adding a
    visible IEUNG. (It "may" be phonetically acceptable only if the vowel
    encoded in the syllable is YE or YO or YU, but I'm not sure about it, and
    you're right that this would break the normal orthograph of Korean words).

    So until there are new VT "syllables" (this would require 21*27=567 code
    points, but one cannot locate them after the existing hangul syllables now
    after U+D7A3, because it would require a free area U+D7A4..U+D9D9 which is
    used partly for high surrogates starting at U+D800) encoded with excluded
    canonical decompositions for stability of decompositions, I fear that it's
    impossible.

    Now I wonder what is the exact role of the choseong filler U+115F in the
    Hangul script except for allowing (not composable) VT syllables for foreign
    or old words (starting by a vowel) and that can only be written with
    separate jamos without forming a ligature with a possible previous leading
    consonnant (terminating another word)...

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 07:05:11 EST