RE: Still can't work out whats a "canonical decomp" vs a "compat ibility decomp"

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri May 09 2003 - 08:33:11 EDT

  • Next message: Edward C. D. Hopkins: "[Unicode] Suggestion to list owner"

    On Thu, 8 May 2003, Marco Cimarosti wrote:

    > Jarkko Hietaniemi wrote:
    > > Another potential Gedankenexperiment would of course be a
    > > Cleanencoding, but I guess the WCode is already quite
    > > good an attempt in that direction (though I must admit
    > > that the WTF encoding makes me grimace a bit :-)
    >
    > Here is Markus' Wcode, for the benefit of new list members:
    >
    > http://www.mindspring.com/~markus.scherer/unicode/wcode.html

      WCode, as it stands, is not 'clean' enough to me for Korean
    script.

         * WCode contains all Unicode characters except ones with a
           decomposition of any kind. Normalization on WCode only sorts
           combining characters in canonical order. (This removes some
           13000(?) characters from the BMP. WCode is mostly Unicode NFKD.)

    If I could begin from the scratch, I'd remove all 'cluster Jamos' in
    U+1100 block in addition to precomposed Hangul syllables (that are
    removed by the above provision). That leaves us with 17 ( + 4 ) leading
    consonants, 11 medial vowels and 17 ( + 4 ) trailing consonants along
    with leading Jamo filler(U+115F) and vowel filler(U+1160) [1], totalling
    55 code points down from over 12,000 code points for Korean script
    freeing up a huge amount of code space in BMP for *much better* use. [2]
    This has an additional benefit of making SCSU/BCU better suited for
    Korean text represented in Jamos because all Jamos can fall within a
    single sliding window of SCSU/BCU. It also simplifies collation/sorting.

    Jungshik

    [1] We can cut down code points further by encoding consonants only
    once (and perhaps adding trailing consonant filler). Here we have 35
    code points. In this scheme, a regular Korean syllable takes the form
    of L+V+T+M? where L,V, and T include fillers. Similar encodings were
    used in mid-1980's on Korean Unix systems (before KS C 5601-1987, now
    KS X 1001:1998)

    [2] WCode already frees up 11,172 code points as it stands, my scheme
    gives us back about 180-210 more.



    This archive was generated by hypermail 2.1.5 : Fri May 09 2003 - 09:22:38 EDT