Accumulated Feedback on PRI #193

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Aug 26 18:42:20 CDT 2011
Contact: umavs@ca.ibm.com
Name: V.S. Umamaheswaran
Report Type: Public Review Issue
Opt Subject: PRI 193 ... UAX 29


Forwarding a comment from our Thai expert: Nattapong Sirilappanich (email id: natta@th.ibm.com) ...

My inputs are: 1. Keep all grapheme properties for Thai characters. This is my
input for section 3 question: [Editorial Note: The text has been modified to
not favor extended grapheme clusters, given that legacy grapheme clusters are
preferred for Thai, Lao, and Tai Viet characters. An alternative approach
would be to remove the characters (U+0E30, U+0E32, U+0E33, U+0E40-U+0E45,
U+0EB0, U+0EB2, U+0EB3, U+0EC0-U+0EC4, U+AAB5, U+AAB6, U+AAB9, U+AABB, U+AABC)
from Extend and Spacing_Mark. Feedback on this issue would be appreciated.]

2. To prevent confusion, let's make it clear. Thai language use legacy
grapheme for cursor movement and editing behavior. Thai language also use
extended grapheme for rendering purpose. So this line in section 3 should be
updated: However, for Southeast Asian scripts such as Thai and Lao, the legacy
grapheme clusters are generally preferred

Date/Time: Mon Oct 24 08:22:15 CDT 2011
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI #193: Proposed Update UAX #29: Unicode Text Segmentation


A minor editorial comment:

8. Hangul Syllable Boundary Determination
(http://www.unicode.org/reports/tr29/tr29-18.html#Hangul_Syllable_Boundary_Determination):
Under subtitle “Transforming into Standard Korean Syllables”, in the line
[^L] V → [^L] Lf V
the f is neither subscripted nor italicized as it should be.

Feedback Received After Closing Date

Date/Time: Thu Nov 10 02:33:11 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: word breaks with hiragana and voiced marks


I'd like to renew an old feedback I made about word breaks with
hiragana and voiced marks in an UAX#29 PRI in... 2007. Because
absoluetly nobody seems to have replied to this feedback, and visibly
some characters that are used in both hiragana and katakana are not
treated consistently as they should (for example with differences
between normal and halfwidth variants).

See http://unicode.org/mail-arch/unicode-ml/y2007-m08/0091.html

Quoting the message:
This UAX treats KATAKANA specially, to avoid breaking between two 
Katakana letters, but still break between hiragana. However, this 
is probably not true for every thing, notably in the sequence of 
an Hiragana letter and a voiced/semi voiced mark: 
U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK 
U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK 
and possibly other characters currently listed in the Katakana value in table 3: 
U+3031 (〱) VERTICAL KANA REPEAT MARK 
U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK 
U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF 
U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF 
U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF 
U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN 
U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK 
U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK 
U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK 
U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK 
Do really word break occur between Hiragana letters and these 
marks coded after them (note that Hiragana letters are excluded 
from "Aletter" in table 3) ? If not, then 
(1) the list of characters above should better be listed under a 
    separate value (say "ExtendKana"), and removed from Katakana in table 3. 
(2) a new value "Hiragana" should be created for Hiragana letters in table 
    3, like this: 
        Katakana	script="KATAKANA" (rewritten first row in table 3) 
        Hiragana	script="HIRAGANA" (new inserted row in table 3) 
        ExtendKana	(the list of characters above) (new row in table 3) 
(3) the existing rule WB13 (Katakana × Katakana) should be rewritten 
    equivalently as: 
        WB13. (Katakana | ExtendKana) × (Katakana | ExtendKana) 
(4) the following subrules WB13a and WB13b rewritten equivalently as: 
        WB13a. (ALetter | Numeric | Katakana | ExtendKana | ExtendNumLet) 
                × ExtendNumLet 
        WB13b. ExtendNumLet × (ALetter | Numeric | Katakana | ExtendKana) 
(5) Another subrule should be added: 
        WB13c. (Hiragana | ExtendKana) × ExtendKana 
No other change is needed, because word break will still occur either 
between two Hiragana letters, or after an ExtendKana and before a 
Hiragana letter, in the next rule: 
        WB14. Any ÷ Any 
Or am I missing something? 

Date/Time: Thu Nov 10 03:26:08 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: property values used in [Charts29]


The UAX#29 makes an informative reference to a sample 
table-based implementation shown in [Charts29]:

http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.html

However this HTML page still contains row and column headers
containing old non-standard property values that do not match the
Word-Break property values enumerated and described in UAX#29 and
assigned to characters in the normative datafile [Data29].

Why are those "_FE" suffixes" added to a couple of property values in
the chart table and in all tooltips appearing when hovering characters
in the sample strings ?

It seems that this [Charts29] page has never been updated since long,
and this is also visible in the numeric mapping of rules names (which
is also used in the test data file), which:

- forgets to assign the number 3.2 to the rule named WB3b (insert
a word break before an explicit line break, "÷ (Newline | CR | LF)") ;

- still incorrectly defines the rule named WB4 and numbered 4.0 
as the outdated contextual rule
"[^ Newline CR LF ] × [Format  Extend]", instead of the current 
rewriting rule "X (Extend | Format)* → X".

- gives the wrong definition for the last contextual rule, named 
WB14 and numbered 999.0, displaying "÷ Any", instead of "Any ÷ Any" ;