UAX 15 hangul composition

From: Theo Veenker (Theo.Veenker@let.uu.nl)
Date: Tue Aug 03 2004 - 06:47:36 CDT

  • Next message: Mustafa Jabbar: "Re: International CALIBER-2005: Call for Papers"

    Don't know if this has been asked/reported before, but is the example code
    for hangul composition in UAX 15 correct?

    The code is:
         public static String composeHangul(String source) {
             int len = source.length();
             if (len == 0) return "";
             StringBuffer result = new StringBuffer();
             char last = source.charAt(0); // copy first char
             result.append(last);

             for (int i = 1; i < len; ++i) {
                 char ch = source.charAt(i);

                 // 1. check to see if two current characters are L and V

                 int LIndex = last - LBase;
                 if (0 <= LIndex && LIndex < LCount) {
                     int VIndex = ch - VBase;
                     if (0 <= VIndex && VIndex < VCount) {

                         // make syllable of form LV

                         last = (char)(SBase + (LIndex * VCount + VIndex) * TCount);
                         result.setCharAt(result.length()-1, last); // reset last
                         continue; // discard ch
                     }
                 }

                 // 2. check to see if two current characters are LV and T

                 int SIndex = last - SBase;
                 if (0 <= SIndex && SIndex < SCount && (SIndex % TCount) == 0) {
                     int TIndex = ch - TBase;
                     if (0 <= TIndex && TIndex <= TCount) {

                         // make syllable of form LVT

                         last += TIndex;
                         result.setCharAt(result.length()-1, last); // reset last
                         continue; // discard ch
                     }
                 }

                 // if neither case was true, just add the character

                 last = ch;
                 result.append(ch);
             }
             return result.toString();
         }

    Suppose I feed it 0xAC00 0x11C3. 0xAC00 is an LV.
    This will do step 2:

    SIndex = 0xAC00 - 0xAC00 = 0
    TIndex = 0x11C3 - 0x11A7 = 28

    Which causes the "(0 <= TIndex && TIndex <= TCount)" to be true.
    And the resulting output is 0xAC00 + 28 = 0xAC1C which is not
    an LVT but an LV syllable!

    The TIndex <= TCount should be TIndex < TCount I think. IMO the
    example would be more clear if the Hangul_Syllable_Type property
    would be used.

    A somewhat related question. I know next to nothing about Hangul
    [de]composition so forgive me for asking silly questions. In the
    UnicodeData.txt file there are much more than the 19 L, 21 V, and
    28 L jamos. Are the other jamos not use to compose syllables, or
    does the syllable block represent an incomplete set of compatibility
    characters? What's is it?

    Theo



    This archive was generated by hypermail 2.1.5 : Tue Aug 03 2004 - 06:49:11 CDT