Re: Error in Hangul composition code

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Tue Jul 06 2004 - 10:25:40 CDT

Next message: Gerd Schumacher: "Re: Looking for transcription or transliteration standards latin- >arabic"

Previous message: Donald Z. Osborn: "Re: Transliteration in Asia, was Re: Hausa: Boko<->Ajami?"
In reply to: Marcin 'Qrczak' Kowalczyk: "Error in Hangul composition code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> <http://www.unicode.org/reports/tr15/> says:
>
> int SIndex = last - SBase;
...

The arithmetic decomposition of the Hangul Syllable
characters can be described as follows:

Each Hangul precomposed syllable character of
Hangul_Syllable_Type LV has a canonical decomposition
into an L and a V Hangul jamo:

LV: s
L in 1100–1112: LBase + ((s – SBase) div NCount)
V in 1161–1175: VBase + (((s – SBase) mod NCount) div TCountP1)

Each Hangul precomposed syllable character of
Hangul_Syllable_Type LVT has a canonical decomposition
into an LV Hangul syllable character and a T Hangul jamo:

LVT: s
LV: SBase + (((s – SBase) div NCount) * NCount)
T in 11A8–11C2: TBaseM1 + ((s – SBase) mod TCountP1)

(TBaseM1 is TBase-1, and TCountP1 is TCount+1)

This makes them decompose just like other canonical
decompositions into (one or) two other characters;
not more than two. The arithmetic description is then
just a shorthand for a long list of 11000+ canonical
decompositions (which can't be into more than two
other characters). They could in principle be handled
in normalisation code just like any other canonical
decomposition/composition, given that expanded table.
Code based on the arithmetic expressions are just more
efficient in achieving the same thing.

The composition can likewise be described arithmetically.

Note the use of the (relatively) new Hangul_Syllable_Type
property.

Some pseudo-code (for those who like code) based on this
for composing Hangul Syllable characters (I will spare
you the pseudocode for decomposing, this reply is getting
too long already):

    public static String composeHangul(String source)
    {
        int len = source.length();
        if (len == 0)
            return "";

StringBuffer result = new StringBuffer();

// Hangul is in the BMP, so we need not worry about higher planes.

char prev = source.charAt(0); // get first char

        for (int i = 1; i < len; i++)
        {
            char curr = source.charAt(i);

            if ('\u1100' <= prev && prev <= '\u1112' && // "modern" L
                '\u1161' <= curr && curr <= '\u1175') // "modern" V
            {
                // make a syllable of the form LV
                prev = (char)(SBase + ((prev–LBase) * NCount) +
                                      ((curr–VBase) * TCountP1));
            }
            else if (hangulSyllableType(prev) == HangulSyllableType.LV &&
                     '\u11A8' <= curr && curr <= '\u11C2') // "modern" T
            {
                // make a syllable of the form LVT
                prev += curr – TBaseM1;
            }
            else
            {
                // no arithmetic composition possible, move on
                result.append(prev);
                prev = curr;
            }
        }
        result.append(prev); // don't loose last char in string
        return result.toString();
    }

Note that, while NOT part of Unicode decompositions, many
of the Hangul Jamo characters decompose into two or three
other Hangul Jamo letters. But that is much beyond UAX 15,
unfortunately.

/kent k

Next message: Gerd Schumacher: "Re: Looking for transcription or transliteration standards latin- >arabic"
Previous message: Donald Z. Osborn: "Re: Transliteration in Asia, was Re: Hausa: Boko<->Ajami?"
In reply to: Marcin 'Qrczak' Kowalczyk: "Error in Hangul composition code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jul 06 2004 - 10:27:05 CDT