Public Review Issues

Accumulated Feedback on PRI #335

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Oct 24 04:54:54 CDT 2016
Name: SWARAN LATA
Report Type: Other Question, Problem, or Feedback
Opt Subject: Indic Line breaking rules for #335 - Proposed Update UAX #14, Unicode Line Breaking Algorithm

In Indian languages writing system , it is preferred that line breaks at word
boundaries ,if required following principle may be adhered :  New line cannot
begin with following symbols/Punctuation marks. Also these should be retain
with the associated text :

Symbols 	Character name 	        Unicode code-point 
। 	        DEVANAGARI DANDA 	U + 0964 
॥ 	        DEVANAGARI DOUBLE DANDA U + 0965 
) 	        RIGHT PARENTHESIS 	U + 0029 
+ 	        PLUS SIGN 	        U + 002B 
* 	        ASTERISK 	        U + 002A 
- 	        HYPHENATIONPOINT        
-               VISIBLE HYPHEN HYPHENATION  U + 2027
-               SOFT HYPHEN 	            U+ 00AD
/ 	        SOLIDUS 	        U + 002F 
, 	        COMMA 	                U + 002C 
. 	        FULL STOP 	        U + 002E 
: 	        COLON 	                U + 003A 
; 	        SEMICOLON 	        U + 003B 
= 	        EQUALS SIGN 	        U + 003D 
>  	        GREATER-THAN SIGN 	U + 003E 
] 	        RIGHT SQUARE BRACKET 	U + 005D 
_ 	        LOW LINE 	        U + 005F 
| 	        VERTICAL LINE 	        U + 007C 
} 	        RIGHT CURLY BRACKET 	U + 007D 
~ 	        TILDE 	                U + 007E 
% 	        PERCENT SIGN 	U + 0025 
		
Hyphenation at line boundary in Indian languages 

•	Hyphen should be used at the breaking point so that word can be read intuitively. 

•	However the language specific morpho-phonemic rules and industry practices
	(from media, publishing and grammar books) could be used for hyphenation.
	U+ 00AD (soft hyphen) is used in some languages such as Tamil and Malayalam.

•	The hyphenated words can be broken at the hyphenation point (U + 2027) e.g.:
	    नर-नारी should be treated as:
	    नर- on the first line and नारी on the next line

Feedback above this line was reviewed during UTC #149.

Date/Time: Thu Mar 9 10:27:25 CST 2017
Name: Elmar Braun
Report Type: Public Review Issue
Opt Subject: Public Review Issue #335 / UAX #14 U+FF70

UAX #14 revisions 37 and proposed 38 list U+FF70 twice, in the table of 
characters in class CJ, and also in the table of characters in class NS. 
According to the Unicode data 9.0, U+FF70 is in line breaking class CJ. 
Therefore I believe its listing under class NS to be in error.

Date/Time: Sun Apr 2 18:39:14 CDT 2017
Name: Rainer Perske
Report Type: Error Report
Opt Subject: Errors in LineBreakTest.txt

Dear Sir or Madam

I have found three wrong entries in http://www.unicode.org/Public/9.0.0/ucd/auxiliary/LineBreakTest.txt

Wrong entries are (without comment):
× 200D × 0308 ÷ 231A ÷
× 200D × 0308 ÷ 261D ÷
× 200D × 0308 ÷ 1F3FB ÷

These entries violate the non-tailorable line breaking rules as indicated below.

These lines represent characters with line break property:
ZWJ CM ID
ZWJ CM EB
ZWJ CM EM

LB9 says: Do not break a combining character sequence; treat it as if it has the 
line breaking class of the base character in all of the following rules. Treat 
ZWJ as if it were CM.

The rule is: Treat X (CM | ZWJ)* as if it were X.

Hence these lines are to be treated as:
ZWJ ID
ZWJ EB
ZWJ EM

LB8a says: Do not break between a zero width joiner and an ideograph, emoji base 
or emoji modifier.

The rule is: ZWJ × (ID | EB | EM)

Hence the lines should read:
× 200D × 0308 × 231A ÷
× 200D × 0308 × 261D ÷
× 200D × 0308 × 1F3FB ÷

Kind regards

Rainer Perske

Date/Time: Sat Apr 29 22:48:11 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #335: More double diacritics

Just as U+035C..0362 have line break class GL, so should the other double diacritics 
U+1DCD and U+1DFC. Similarly, it may make sense for the left and conjoining half marks 
U+FE20, U+FE22, U+FE24, U+FE26..FE27, U+FE29, U+FE2B, and U+FE2D..FE2E to have line 
break class GL.

Date/Time: Sat Apr 29 23:16:48 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #335: Breaking between Hebrew and hyphens

The annex says that U+00AD SOFT HYPHEN “is an invisible format character with 
no width. It marks the place where an optional line break may occur inside a word. 
It can be used with all scripts.” However, because of LB21a, it does not work with 
the Hebrew script.

U+05BE HEBREW PUNCTUATION MAQAF, the Hebrew hyphen, has line break class BA. However, 
LB21a prevents a line break between HL and BA, making the maqaf essentially a 
non-breaking hyphen. If this is intentional, it should have line break class GL, to 
make the intent clear; if not, LB21a needs refinement, or even deletion.

It would be nice if the annex explained the rationale for LB21a. See also L2/13-083.

Date/Time: Thu May 4 18:29:06 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #335: Syllable-based line breaks

The Unicode Standard explains how to break lines in Batak, Cham, Javanese, and
Vai based on orthographic syllables. I suggest formally encoding these four
scripts’ rules in UAX #14.

Here are some specific suggestions. Vai letters should all be ID, except
U+A60B..A60C, which should be CM. Cham letters U+AA00..AA24 and U+AA26..AA28
should be ID. Cham final consonants U+AA40..AA42 and U+AA44..AA4B should be
CM. U+AA25 is a tricky one. Batak and Javanese letters should be ID. Batak
killers should have a new class such that [ID × ID Batak_Killer] and [ID ×
Batak_Killer]. U+A9C0 JAVANESE PANGKON should be GL, or new class that is like
GL but only glues lb=ID characters. I don’t know how to break around U+A9CF
JAVANESE PANGRANGKEP but it might need a new class too.

There are some ambiguities in the rules (e.g. U+AA25 can be syllable-final or
-initial) but for the most part, they work. They are better than the current
situation, where most implementers don’t know about these scripts, so they
follow the default rules and miss many line break opportunities.

If the UTC doesn’t like this idea, I suggest copying the rules from the core
spec to a non-normative section of UAX #14. That would at least give these
rules more visibility. Someone looking to implement a Unicode line-breaker is
more likely to read the line-breaking spec than the entire core spec.