L2/19-039

 

Grapheme break property for U+FF9E and U+FF9F

Eric Muller, Amazon

January 10, 2019

 

 

The characters U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK and U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK have GCB=EX just like their compatibility decompositions, U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK and U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK, respectively. However, unlike their compatibility decompositions, they are not combining marks. They have gc=Lm, and in fact are the only two characters with GCB=EX and gc=Lm.

While it is true that those characters usually follow other (half width katakana) characters, and in some sense function as a unit with them, the same can be said of most of the other Lm characters, notably U+30FD KATAKANA ITERATION MARK and U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK.

It is not unreasonable to style those characters differently than the character they follow (e.g. in different fonts). In general, mixed styling of grapheme clusters leads to complications.

It is also reasonable to show those characters in isolation. In this case, they would probably form a grapheme cluster with the character that precedes them; but unlike combining marks, there is no established mechanism to give them a “base”

For those reasons, we recommend that the GCB property of those two characters be changed from EX to XX.