RE: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Dec 03 2003 - 12:36:09 EST

  • Next message: Patrick Andries: "Re: MS Windows and Unicode 4.0 ?"

    On Wed, 3 Dec 2003, Philippe Verdy wrote:

    > I just have another question for Korean: many jamos are in fact composed
    > from other jamos: this is clearly visible both in their name and in their
    > composed glyph. What would be the linguistic impact of decomposing them (not
    > canonically!)? Do Korean really learn these jamos without breaking them into
    > their components? I think here about SSANG (double) consonnants, or the

      The Korean alphabet invented in 1443 and announced in 1446 included 17
    consonants and 11 vowels. Modern Korean uses 14 consoants and 10 vowels
    (3 consonants and 1 vowel have become obsolete. Korean 'ABC-song'
    enumerates them only (i.e. it doesn't include cluster/complex letters.)
    The vowel 'U+119E ARAE A ᆞ' were used until the early 20th century
    when it was 'officially' made out of use in the draft standard of Korean
    orthography by the Korean Linguistic Society in 1933 [1], which became
    the basis of both South and North Korean orthographic standards after
    the division of the country. See p. 6(of the PDF file, or p. 2 in the
    actual document) of the scanned copy of the draft standard for the list
    of Korean letters along with names(The upper left part of p.6 in PDF
    when rotated counterclockwise by 90 degrees.) All others are composed
    out of them. A few additional consonants were used briefly to transcribe
    Chinese phonems in phonetic textbooks in the 15th century, but have not
    been used otherwise.

      I and Kent, on several occasions, wrote that complex Korean letters
    (Korean letter clusters) should have been made __canonically_ equivalent
    to basic Korean letter sequences. They were compatibly equivalent to each
    other in Unicode 2.0, but even that compatible equivalence was removed
    instead of being upgraded to the canonical equivalence. That's another
    mistake in Korean encoding in Unicode. In the first place, complex
    Korean letters should not have been encoded just like precomposed
    syllables should not have been. With the NFC/NFD frozen forever,
    it is now impossible to rectifiy this.

    > initial Y or final E of some vowels...
    > Of couse I won't be able to use such decomposition in Unicode, but would it
    > be possible to use it in some private encoding created with a m:n charset
    > mapping from/to Unicode?

      That kind of composition/decomposition is necessary for linguistic
    analysis of Korean. Search engines (e.g. google), rendering engines
    and incremental searches also need that. See

      http://i18nl10n.com/korean/jamo.html
      (you need Unbatang font - GPL'd opentype font for Korean-
       available at http://i18nl10n.com/fonts/UnBatang.ttf and mozilla
       either on Linux/Unix or on Windows. Uniscribe on XP
       can take advantage of Korean opentype fonts, but only to a limited extent.
       In particular, it doesn't support the kind of equivalence I'm talking
       about here so that for Mozilla even on Windows 2k/XP, I had to
       build a custom composition routine)
      http://i18nl10n.com/korean/jamocomp.html
      http://bugzilla.mozilla.org/show_bug.cgi?id=176315
      http://bugzilla.mozilla.org/show_bug.cgi?id=177877
      http://bugzilla.mozilla.org/show_bug.cgi?id=176290

      Jungshik

    [1] http://i18nl10n.com/korean/orth1933.pdf



    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 13:28:28 EST