L2/06-082 Date/Time: Mon Mar 6 22:19:06 CST 2006 Contact: jjc@sipa.or.th Name: James Clark Report Type: Error Report Opt Subject: Thai collation errors 1. http://www.unicode.org/reports/tr10/#Modifications lists the following amongst the modifications made to DUCET in 4.1.0" "After the last secondary ignorable Insertion of the character U+0E2F ฯ THAI CHARACTER PAIYANNOI Then the character U+0E46 ๆ THAI CHARACTER MAIYAMOK Then the character U+0E4F ๏ THAI CHARACTER FONGMAN Then the character U+0E5A ๚ THAI CHARACTER ANGKHANKHU Then the character U+0E5B ๛ THAI CHARACTER KHOMUT" However, as far as I can see there has been no change in DUCET between 4.0.0 and 4.1.0 as regards those 5 characters. The description of the change is also rather strange. It talks about inserting the characters "after the last secondary ignorable", but DUCET doesn't have any secondary ignorables. In both 4.0.0 and 4.1.0, FONGMAN, ANGKKHANKHU and KHOMUT are treated as variable collation elements. This seems correct to me from a Thai perspective, and also consistent with the handling of similar characters in other scripts, such as Khmer. The handling of PAIYANNOI and MAIYAMMOK is on the other hand not correct in either 4.0.0 or 4.1.0. These should both be treated similarly to punctuation characters such as ANGKHANKHU. The obvious fix is to insert them as variable collation elements before FONGMAN. 2. Perhaps a similar fix should also be made to LAO KO LA (U+0EC6). 3. http://www.unicode.org/reports/tr10/#Modifications also lists "After U+0E24 ฤ THAI CHARACTER RU Insertion of the sequence: U+0E24 ฤ {THAI CHARACTER RU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO After U+0E26 ฦ THAI CHARACTER LU Insertion of the sequence: U+0E26 ฦ THAI CHARACTER LU + U+0E45 ๅ THAI CHARACTER LAKKHANGYAO" However these changes have not been made. I also don't believe such a change is necessary. In well-formed Thai, RU and LU cannot combine with any vowels. Thus making LAKKHANGYAO sort like a vowel (as was done in 4.1.0) is sufficient to ensure strings including RU and LU sort correctly. 4. The following contractions: 0E40 0E24 ; [.1B0B.0020.0002.0E24][.1B22.0020.001F.0E40] # 0E41 0E24 ; [.1B0B.0020.0002.0E24][.1B23.0020.001F.0E41] # 0E42 0E24 ; [.1B0B.0020.0002.0E24][.1B24.0020.001F.0E42] # 0E43 0E24 ; [.1B0B.0020.0002.0E24][.1B25.0020.001F.0E43] # 0E44 0E24 ; [.1B0B.0020.0002.0E24][.1B26.0020.001F.0E44] # 0E40 0E26 ; [.1B0D.0020.0002.0E26][.1B22.0020.001F.0E40] # 0E41 0E26 ; [.1B0D.0020.0002.0E26][.1B23.0020.001F.0E41] # 0E42 0E26 ; [.1B0D.0020.0002.0E26][.1B24.0020.001F.0E42] # 0E43 0E26 ; [.1B0D.0020.0002.0E26][.1B25.0020.001F.0E43] # 0E44 0E26 ; [.1B0D.0020.0002.0E26][.1B26.0020.001F.0E44] # are unnecessary because in well-formed Thai, RU and LU cannot occur in a syllable with a vowel (because they already include a vowel sound). 5. Currently you have PHINTHU sorting like a vowel, but it ought to be sorted more like a tone-mark. I would suggest inserting it between YAMAKKAN and MAITAIKHU. I refer you to the following paper (in Thai) for a comprehensive discussion on Thai sorting http://www.nectec.or.th/NTJ/pdf/NTJ_sep_oct_1999.pdf There are some other differences between what that paper recommends and what you currently have, but I think they are more arguable, and should perhaps be handled by Thai-specific tailoring. Specifically, it recommends that - THANTHAKAT is sorted before the tone marks not after - NIKHAHIT sorts between consonants and vowels