L2/05-303 Title: Inconsistencies in the set of Linebreak=SA characters Source: Mark Davis Date: October 14, 2005 There are some inconsistencies in the set of Linebreak=SA characters that came up on the UTC mailing list. Ken found some inconsistencies, which I tried to wrap up as follows: I wasn't really involved in it, but I think the reasoning for the Mc's is that in the Line Break Algorithm, there is a rule early on that CMs behave like their base. So if you have SA CM it behaves like SA. As long as Mc and Mn follow their base (well-formed text), it is thus not a problem for linebreak (or wordbreak). (I put in an invariant to test for SA at the end of the document, so you can see the results). A. I think the real problematic characters are thus: 17D7 # Lm KHMER SIGN LEK TOO 17DC # Lo KHMER SIGN AVAKRAHASANYA These should be SA, since they are Alphabetic. They are currently: 17D7;NS # KHMER SIGN LEK TOO 17DC;AL # KHMER SIGN AVAKRAHASANYA B. However, it probably does little harm to have SA be a *broader* (within a script) than just the Alphabetic. The purpose of SA is to switch to a specialized algorithm, and adding in the M* lets the specialized algorithm get more context to work with -- it gets control in more places. So it would be possible (but I'm not sure whether or not I would recommend it yet -- would take some discussion) to change the following to also be SA: 0E31 # Mn THAI CHARACTER MAI HAN-AKAT 0E34..0E3A # Mn [7] THAI CHARACTER SARA I..THAI CHARACTER PHINTHU 0E47..0E4E # Mn [8] THAI CHARACTER MAITAIKHU..THAI CHARACTER YAMAKKAN 0EB1 # Mn LAO VOWEL SIGN MAI KAN 0EB4..0EB9 # Mn [6] LAO VOWEL SIGN I..LAO VOWEL SIGN UU 0EBB..0EBC # Mn [2] LAO VOWEL SIGN MAI KON..LAO SEMIVOWEL SIGN LO 0EC8..0ECD # Mn [6] LAO TONE MAI EK..LAO NIGGAHITA 102C # Mc MYANMAR VOWEL SIGN AA 102D..1030 # Mn [4] MYANMAR VOWEL SIGN I..MYANMAR VOWEL SIGN UU 1031 # Mc MYANMAR VOWEL SIGN E 1032 # Mn MYANMAR VOWEL SIGN AI 1036..1037 # Mn [2] MYANMAR SIGN ANUSVARA..MYANMAR SIGN DOT BELOW 1038 # Mc MYANMAR SIGN VISARGA 1039 # Mn MYANMAR SIGN VIRAMA 1056..1057 # Mc [2] MYANMAR VOWEL SIGN VOCALIC R..MYANMAR VOWEL SIGN VOCALIC RR 1058..1059 # Mn [2] MYANMAR VOWEL SIGN VOCALIC L..MYANMAR VOWEL SIGN VOCALIC LL 17B6 # Mc KHMER VOWEL SIGN AA 17B7..17BD # Mn [7] KHMER VOWEL SIGN I..KHMER VOWEL SIGN UA 17BE..17C5 # Mc [8] KHMER VOWEL SIGN OE..KHMER VOWEL SIGN AU 17C6 # Mn KHMER SIGN NIKAHIT 17C7..17C8 # Mc [2] KHMER SIGN REAHMUK..KHMER SIGN YUUKALEAPINTU 17C9..17D3 # Mn [11] KHMER SIGN MUUSIKATOAN..KHMER SIGN BATHAMASAT 17DD # Mn KHMER SIGN ATTHACAN C. There is a case to be made for the oddball Cf characters to really be Mn (but I'm not the one to make it ;-) . 17B4..17B5 ; SA # Cf [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA D. According to information from Martin Hosken, LineBreak property assignments for Tai Le and New Tai Lue should be assigned similarly. Mark ============== # Test SA characters # They are limited to certain scripts: Let $SAScripts = [$script:thai $script:lao $script:myanmar $script:khmer] $SAScripts » $LineBreak:SA # Check that in those scripts, SA = alphabetic - marks (plus a few oddball Cfs) [$SAScripts & [$Alphabetic $gc:cf - $gcAllMarks]] = $LineBreak:SA FALSE **** START Error Info **** In $LineBreak:SA, but not in [[$script:thai $script:lao $script:myanmar $script:khmer] & [$Alphabetic $gc:cf - [$gc:Nonspacing_Mark $gc:Enclosing_Mark $gc:Spacing_Mark]]] : # Total code points: 0 Not in $LineBreak:SA, but in [[$script:thai $script:lao $script:myanmar $script:khmer] & [$Alphabetic $gc:cf - [$gc:Nonspacing_Mark $gc:Enclosing_Mark $gc:Spacing_Mark]]] : 17D7 # Lm KHMER SIGN LEK TOO 17DC # Lo KHMER SIGN AVAKRAHASANYA # Total code points: 2 In both $LineBreak:SA, and in [[$script:thai $script:lao $script:myanmar $script:khmer] & [$Alphabetic $gc:cf - [$gc:Nonspacing_Mark $gc:Enclosing_Mark $gc:Spacing_Mark]]] : 0E01..0E30 # Lo [48] THAI CHARACTER KO KAI..THAI CHARACTER SARA A 0E32..0E33 # Lo [2] THAI CHARACTER SARA AA..THAI CHARACTER SARA AM 0E40..0E45 # Lo [6] THAI CHARACTER SARA E..THAI CHARACTER LAKKHANGYAO 0E46 # Lm THAI CHARACTER MAIYAMOK 0E81..0E82 # Lo [2] LAO LETTER KO..LAO LETTER KHO SUNG 0E84 # Lo LAO LETTER KHO TAM 0E87..0E88 # Lo [2] LAO LETTER NGO..LAO LETTER CO 0E8A # Lo LAO LETTER SO TAM 0E8D # Lo LAO LETTER NYO 0E94..0E97 # Lo [4] LAO LETTER DO..LAO LETTER THO TAM 0E99..0E9F # Lo [7] LAO LETTER NO..LAO LETTER FO SUNG 0EA1..0EA3 # Lo [3] LAO LETTER MO..LAO LETTER LO LING 0EA5 # Lo LAO LETTER LO LOOT 0EA7 # Lo LAO LETTER WO 0EAA..0EAB # Lo [2] LAO LETTER SO SUNG..LAO LETTER HO SUNG 0EAD..0EB0 # Lo [4] LAO LETTER O..LAO VOWEL SIGN A 0EB2..0EB3 # Lo [2] LAO VOWEL SIGN AA..LAO VOWEL SIGN AM 0EBD # Lo LAO SEMIVOWEL SIGN NYO 0EC0..0EC4 # Lo [5] LAO VOWEL SIGN E..LAO VOWEL SIGN AI 0EC6 # Lm LAO KO LA 0EDC..0EDD # Lo [2] LAO HO NO..LAO HO MO 1000..1021 # Lo [34] MYANMAR LETTER KA..MYANMAR LETTER A 1023..1027 # Lo [5] MYANMAR LETTER I..MYANMAR LETTER E 1029..102A # Lo [2] MYANMAR LETTER O..MYANMAR LETTER AU 1050..1055 # Lo [6] MYANMAR LETTER SHA..MYANMAR LETTER VOCALIC LL 1780..17B3 # Lo [52] KHMER LETTER KA..KHMER INDEPENDENT VOWEL QAU 17B4..17B5 # Cf [2] KHMER VOWEL INHERENT AQ..KHMER VOWEL INHERENT AA # Total code points: 198 **** END Error Info ****