Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Dec 03 2003 - 13:24:17 EST

  • Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"

    On Wed, 3 Dec 2003, Doug Ewell wrote:

    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Speaking of which, I just noticed that the function in SC UniPad to
    > compose syllables from jamos does not handle this case (LV + T = LVT).
    > I'll have to report that to the UniPad team.

      Yudit, Mozilla and soon a whole bunch of applications written with Pango
    treat them equivalent :-) BTW, Uniscribe doesn't treat them equivalent,
    either. I 'failed to' persuade 'MS' that they should be treated
    equivalent.

    > > I just have another question for Korean: many jamos are in fact
    > > composed from other jamos: this is clearly visible both in their name
    > > and in their composed glyph. What would be the linguistic impact of
    > > decomposing them (not canonically!)? Do Korean really learn these
    > > jamos without breaking them into their components? I think here about
    > > SSANG (double) consonnants, or the initial Y or final E of some
    > > vowels...
    >
    > This would be a good question for Jungshik or another native Korean. I
    > have read that Korean children learn the syllables as whole units,

      There's no single view of what constitutes an 'atom'
    in Korean writing system. You're right that on many occasions to many
    people, syllables are units, but not unbreakable atomic units. Everybody
    knows that syllables are made out of consonants and vowels. Korean
    ABC-song enumerates 14 consonants and 10 vowels only and that's what
    school children learn in the first grade. Most Korean input methods
    haev a configurable option as to what 'backspace/delete' key should do
    (i.e. delete the whole syllable or only the preceding letter.) during the
    syllable 'formation' (i.e. before committed.) On Korean mobile phones,
    we go even smaller. Only three elements (a vertical stroke, a horizontal
    storke and a dot assigned '1', '2' and '3') are used to compose vowels. 14
    consonants are assigned '4' thru '0' (two or three of them together).

      Before KS C 5601-1987, we used n-byte Hangul code, in which we assigned
    a single byte code to each of 19 leading consonants (14 basic + 5 double)
    + filler, 21 vowels (10 basic + 11 'diphtongs') + filler and 27 final
    consonants (14 basic + 13 complex). A syllable was represented with
    either 2bytes or 3bytes. SI and SO were used to toggle between Korean
    mode and ASCII mode (that was before the invention of EUC scheme in Unix
    and we could use only octets with MSB=0)

     Also note that 'JOHAB' encoding (many Koreans regarded as better than
    'Wansung' - KS X 1001-based EUC-KR - although it's not ISO 2022 compliant)
    uses three 5-bit-long bit fields (for leading consonants, vowels and
    final consonants) to encode syllables. The 'MSB' in '2 byte' unit is
    set to 1 to indicate that it's for Korean (not US-ASCII).

    > rather than as an arrangement of jamos as I would see them,

      In the early 20th century, a couple of 'script reform' attempts
    (by prominent Korean linguists) were made to work around the difficulty
    with printing (having to make a lot of metal types). It was proposed that
    letters be written in a linear fashion (just like Latin/Cyrillic/Greek
    alphabets are written), but none of them caught on. In the Korean
    community in the Russian far east, quite an amount of materials were
    published in one of these 'reformed' scripts.

    > leading some
    > to think of Hangul as a featural syllabary instead of an alphabet.

      Korean script is alphabetic, syllabic, and featural all at the same
    time :-) And, it's also logographic just as any other scripts can be
    (e.g. 'enough' in English is logographic in a sense, isn't it?)

      Jungshik



    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 17:37:34 EST