Re: Error in Hangul composition code

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Tue Jul 06 2004 - 10:25:40 CDT

  • Next message: Gerd Schumacher: "Re: Looking for transcription or transliteration standards latin- >arabic"

    > <http://www.unicode.org/reports/tr15/> says:
    >
    > int SIndex = last - SBase;
    ...

    The arithmetic decomposition of the Hangul Syllable
    characters can be described as follows:

    Each Hangul precomposed syllable character of
    Hangul_Syllable_Type LV has a canonical decomposition
    into an L and a V Hangul jamo:

    LV: s
    L in 1100–1112: LBase + ((s – SBase) div NCount)
    V in 1161–1175: VBase + (((s – SBase) mod NCount) div TCountP1)

    Each Hangul precomposed syllable character of
    Hangul_Syllable_Type LVT has a canonical decomposition
    into an LV Hangul syllable character and a T Hangul jamo:

    LVT: s
    LV: SBase + (((s – SBase) div NCount) * NCount)
    T in 11A8–11C2: TBaseM1 + ((s – SBase) mod TCountP1)

    (TBaseM1 is TBase-1, and TCountP1 is TCount+1)

    This makes them decompose just like other canonical
    decompositions into (one or) two other characters;
    not more than two. The arithmetic description is then
    just a shorthand for a long list of 11000+ canonical
    decompositions (which can't be into more than two
    other characters). They could in principle be handled
    in normalisation code just like any other canonical
    decomposition/composition, given that expanded table.
    Code based on the arithmetic expressions are just more
    efficient in achieving the same thing.

    The composition can likewise be described arithmetically.

    Note the use of the (relatively) new Hangul_Syllable_Type
    property.

    Some pseudo-code (for those who like code) based on this
    for composing Hangul Syllable characters (I will spare
    you the pseudocode for decomposing, this reply is getting
    too long already):

        public static String composeHangul(String source)
        {
            int len = source.length();
            if (len == 0)
                return "";

            StringBuffer result = new StringBuffer();

            // Hangul is in the BMP, so we need not worry about higher planes.

            char prev = source.charAt(0); // get first char

            for (int i = 1; i < len; i++)
            {
                char curr = source.charAt(i);

                if ('\u1100' <= prev && prev <= '\u1112' && // "modern" L
                    '\u1161' <= curr && curr <= '\u1175') // "modern" V
                {
                    // make a syllable of the form LV
                    prev = (char)(SBase + ((prev–LBase) * NCount) +
                                          ((curr–VBase) * TCountP1));
                }
                else if (hangulSyllableType(prev) == HangulSyllableType.LV &&
                         '\u11A8' <= curr && curr <= '\u11C2') // "modern" T
                {
                    // make a syllable of the form LVT
                    prev += curr – TBaseM1;
                }
                else
                {
                    // no arithmetic composition possible, move on
                    result.append(prev);
                    prev = curr;
                }
            }
            result.append(prev); // don't loose last char in string
            return result.toString();
        }

    Note that, while NOT part of Unicode decompositions, many
    of the Hangul Jamo characters decompose into two or three
    other Hangul Jamo letters. But that is much beyond UAX 15,
    unfortunately.

          /kent k



    This archive was generated by hypermail 2.1.5 : Tue Jul 06 2004 - 10:27:05 CDT