L2/03-106

Re: LB6 Issue
From: Mark Davis
Date: 2003-03-05

Here is my analysis of the LB6 Issue. The text in the proposed update (version 13) is the following.

LB 6  Don’t break grapheme clusters (before combining marks, around virama or on sequences of conjoining Jamos.

Treat X CM* as if it were X

Treat a sequence J L* JL JV * JV JT* as if it were a Hangul Syllable

This does include the fix from document L2/02-267, which is to never break default grapheme clusters, but there are a few problems.

1. Ordering

Line 2 should be before Line 1, e.g.

Treat a sequence J L* JL JV * JV JT* as if it were a Hangul Syllable

Treat X CM* as if it were X

The reason is that CMs get absorbed, but not in the middle of a Hangul Syllable.

2. Deviation

Linebreak uses the CM class, which includes all Mc, Mn, and Me characters, plus some others. In order for it to actually  not break default grapheme clusters, it needs to include all of the Grapheme_Extend characters. With all but 2 characters it does. The Grapheme_Extend characters includes Mn, Me, and Other_Grapheme_Extend. The first two are included already in CM. The Other_Grapheme_Extend mostly includes Mc characters, which are also in CM. The only 2 outliers are:

FF9E ; Other_Grapheme_Extend # Lm HALFWIDTH KATAKANA VOICED SOUND MARK
FF9F ; Other_Grapheme_Extend # Lm HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

The reason that these are in Other_Grapheme_Extend are that they have decompositions (but notice, compatibility decompositions) to combining marks.

3099;COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK;Mn;...
309A;COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK;Mn;...
FF9E;HALFWIDTH KATAKANA VOICED SOUND MARK;Lm;0;L;<narrow> 3099;...
FF9F;HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK;Lm;0;L;<narrow> 309A;...

Alternative 1. By moving FF9E & FF9F from NS to CM, it would resolve the difference. The effective change would be small. Generally, CM binds even more closely to the previous character than NS; it can never be broken apart. On the other hand, there are some odd cases where NS binds more close across spaces, as in the following.

LB 11  Don’t break within ‘]h’, even with intervening spaces.

CL SP* × NS

So the question is whether the degenerate cases of FF9E & FF9F preceded by a space is an issue.

Alternative 2. Because FF9E & FF9F are compatibility decompositions (not canonical), there is no formal requirement to have them in the default grapheme cluster, so one option is to remove them from Other_Grapheme_Extend. Keeping them in OGE does better reflect usage, and makes NFKC and NFKD text work a bit better, but it is not a formal requirement.

3. Typos: