Accumulated Feedback on PRI #240

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Tue Nov 20 08:37:04 CST 2012
Contact: ritt.ks@gmail.com
Name: Konstantin
Report Type: Error Report
Opt Subject: An issue with breaking sentences and words separated with dot-alike characters


As of Unicode 5.1, the MidNumLet Word_Break property value
(apostrophe-alike + dot-alike characters) caused sequences <(ALetter)+
MidNumLet (ALetter)+> to be treated like a single word. Whilst it
seems to be an improvement in handling words with apostrophes like
"can't" or "aujourd`hui", it also causes a regression in handling
words separated with dot-alike characters (e.g. domain names -- see
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/63311,
missed space(s) in the user's text -- "hi.there", or navigating
through the code -- "struct.member" (yeah, I know this is out of scope
of the default breaking algorithm, but still), and so on).

And the worst thing is that the default algorithm now specifies a
sentence break in the middle of a word. As for example: "mr.Hamster" -
there are two sentences due to rule SB8
(http://www.unicode.org/reports/tr29/#SB8) but still a single word due
to rules WB6-WB7 (http://www.unicode.org/reports/tr29/#WB6).

A simple possible solution is to map some or all of those dot-alike
characters (FULL STOP, ONE DOT LEADER, SMALL FULL STOP, and FULLWIDTH
FULL STOP) back to MidNum Word_Break property value (this way CLDR
tailors the default algorithm for en_US_POSIX).

Another possible solution I see is to split ALetter into Upper, Lower,
and OLetter, and to map those dot-alike characters to some new Term
Word_Break property value (like the appropriate Sentence_Break
property values), and to extend the word breaking rules so that there
are no breaks will be allowed within sequences like <Upper x Term x
Upper (Term)?> and <Lower x Term x Lower (Term)?> surrounded with
<(!(Upper | Lower | OLetter))*>. Then, rules WB6-WB7 could probably be
replaced with ones that specifies a word break for sequence <Lower
(MidLetter | MidNumLet) Upper> and maybe <OLetter (MidLetter |
MidNumLet) Upper>.

Feedback above this line was considered at the February UTC meeting.

Added from mail archive per request from author:

From: Konstantin Ritt <ritt.ks_at_gmail.com>
Date: Sat, 2 Jun 2012 07:22:01 +0300

It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:

The UAX#29 says:
> Another key feature (of default Unicode grapheme clusters) is that <b>default Unicode grapheme clusters are atomic units with respect to the process of determining the Unicode default line, word, and sentence boundaries</b>.
Also this mentioned in UAX#14:
> Example 6. Some implementations may wish to tailor the line breaking algorithm to resolve grapheme clusters according to Unicode Standard Annex #29, “Unicode Text Segmentation” [UAX29], as a first stage. <b>Generally, the line breaking algorithm does not create line break opportunities within default grapheme clusters</b>; therefore such a tailoring would be expected to produce results that are close to those defined by the default algorithm. However, if such a tailoring is chosen, characters that are members of line break class CM but not part of the definition of default grapheme clusters must still be handled by rules LB9 and LB10, or by some additional tailoring.

However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2 "Legacy Support for Space Character as Base for Combining
Marks" in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin

Date/Time: Mon Mar 25 16:47:56 CDT 2013
Contact: wellnhofer@aevum.de
Name: Nick Wellnhofer
Report Type: Public Review Issue
Opt Subject: New word boundary rules in UAX #29, Unicode 6.3.0 (draft 2)


The new rule WB7c in UAX #29, Unicode 6.3.0 (draft 2) can be simplified to read:

    Hebrew_Letter Double_Quote × Hebrew_Letter

The single quote case is already handled in rule WB7.

Date/Time: Wed May 1 17:42:13 CDT 2013
Contact: andy.heninger@gmail.com
Name: Andy Heninger
Report Type: Public Review Issue
Opt Subject: UAX 29 proposed word break rules


In the UAX 29 proposed update (draft 2), there is a redundancy 
in the word break rules.

From the draft we have

WB7.   (ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote)	×   (ALetter | Hebrew_Letter)

WB7c.	Hebrew_Letter (Single_Quote | Double_Quote)   ×   Hebrew_Letter

The "Single_Quote" term in WB7c is redundant - the same sequence of
Hebrew_Letter Single_Quote × Hebrew_Letter
is also covered by WB7.

So WB7c could be simplified to
    Hebrew_Letter Double_Quote  ×  Hebrew_Letter

Date/Time: Fri May 3 05:12:20 CDT 2013
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: objection to changes based on L2/12-282



I object to the change done according to http://www.unicode.org/L2/L2012/12282-colon.html

As I noted several months ago, in http://unicode.org/cldr/trac/ticket/3987,
when the same issue was raised in CLDR:

First:

"... because Swedish uses it in the middle of a word"; well, it is used in a few
particular abbreviations, when the middle of the word is abbreviated away.
There are very few such abbreviations in general use, "c:a" (for "cirka"), "k:a"
(for "kyrka", church), "s:t" (for "sankt"), "g:a" (for "gamla", old). (B.t.w., Danish
and Norwegian uses the abbreviation "ca." for "cirka".)

Colon is also used when adding inflections to abbreviated names, e.g. "tv:n"
(this seems to be used for Finnish as well), "USA:s" (this seems to be used,
at least sometimes, also in Norwegian and (Northern?) Sami), "UFO:t", "UFO:na",
or to numbers, e.g. "3:e", "3:ans". Colon is also used between digits, in e.g.
currency values (like "12:50") and time values (as it is for many languages),
and some other cases. 

So even [it] this use may be more prominent for Swedish, I would not limit it
to just Swedish; and indeed the limitation to "letter colon letter" is too limiting.

[Some other examples, Swedish: "Björn J:son Lindh" (abbreviation), "AIK:are"
(inflection), "Gustav III:s" (inflection). Finnish: Examples taken from
http://fi.wikipedia.org/wiki/Kaksoispiste:
"EU:n" (inflection), "v:sta" (abbreviation), "20:nnelle (inflection of number)",
"STTK:lainen" (inflection), "H:ki" (abbreviation), "t:mi" (abbreviation).]


And then:

I would suggest updating the following rules in UAX 29:

WB6. ALetter × (MidLetter? | MidNumLet?) ALetter
WB7. ALetter (MidLetter? | MidNumLet?) × ALetter

to

WB6. (Numeric | ALetter) × (MidLetter? | MidNumLet?) ALetter
WB7. (Numeric | ALetter) (MidLetter? | MidNumLet?) × ALetter

in order to handle number inflections (like 3:e (for tredje), 3:ans (for treans)).

And change (first one editorial):

U+003A ( : ) COLON (used in Swedish)
to
U+003A ( : ) COLON

and move the colon-like characters from MidLetter? to MidNumLet? (to handle
numerals like "3:50" as one "word").

UAX 29 text changes (editorial):
Change:

Certain cases such as colons in words (c:a) are included in the default even
though they may be specific to relatively small user communities (Swedish)
because they do not occur otherwise, in normal text, and so do not cause a
problem for other languages.

to

Certain cases such as colons in abbreviated words (e.g., "c:a") and inflections
(e.g., "3:ans", "tv:n") are included in the default even though they may be
specific to relatively small user communities (Swedish and other languages)
because they do not occur otherwise, in normal text, and so do not cause a
problem for languages that do not use this convention.

and

It includes characters that may not be appropriate for identifiers, and some
that would not be parts of words. It also permits some characters that may
be part of words in a broad sense, but not part of names, such as in "c:a" in
Swedish, or hyphenation points used in dictionary words.

to 

It includes characters that may not be appropriate for identifiers, and some
that would not be parts of words. It also permits some characters that may
be part of words in a broad sense, but not part of names, such as in some
abbreviations like "c:a" and some inflections like "USA:s" and "3:e" in Swedish,
or hyphenation points used in dictionary words.

[Consider adding some of the Finnish examples too.]

======================

I would also like to point out that colon is also used as a fallback for modifier
letter triangular colon. And this may be used in phonetic notation for many
languages.

======================

Jonathan Kew pointed out in an email recently:

It has also been used in other orthographies to represent tone; for an 
example, see "Table 7: Old Tone Orthography for Etung (Cameroon)" in 
[1]. I'm sure that wouldn't be the only example.

ISTM that a "mid-word" colon should be treated similarly to a hyphen or 
apostrophe in the same position.

=======================

Regarding the suggestion to tailor the in-word behaviour of colon for certain
languages (Swedish, Finnish, ...), in particular in CLDR:

Firstly, it does not help when ":" is used as fallback for triangular colon (a
modifier letter).

Secondly, most text is not language tagged. Even though colon inside of
words may be "unexpected" in some languages, it appears that it being
allowed inside a word would only be noticeable for mistypings, e.g. 
"Participants:George, ..." (space after colon is missing). It could possibly be
an issue in languages where space is not used between words, and "western"
punctuation is used. Maybe Thai. On the other hand, those languages need
special handling (like dictionary lookup) for finding word boundaries anyway.

So I wonder which languages are actually hurt by allowing ":" inside words.
None have been exemplified in the 3987 CLDR ticket, nor in L2-12/282.

Date/Time: Fri May 3 15:48:28 CDT 2013
Contact: asmus@unicode.org
Name: asmus
Report Type: Public Review Issue
Opt Subject: objection to changes based on L2/12-282


I second the objections brought by Kent Karlsson under this subject header.

I would like to further point out that colon is used in legal personal names
in Sweden, (and possibly in entity names in a wider context).

The problem with names is that they must be supported in databases and other
systems where data do not form single-language "documents" and were language-
tagging and language-sensitive processing is not performed on a per field
basis.

In the European context, databases with names from multiple countries are a
common use case. Database reports, including mail merge, would easily insert a
"Swedish" name into a document that is otherwise not Swedish.

I feel that having the default algorithm fail at lists of names or documents
that contain names would be suboptimal, and that pointing this off to
tailoring on the base of language is a non-starter.

Feedback above this line was considered at the May 2013 UTC meeting.

Ed Note: The linebreak properties of U+3035 has been changed in 6.3

Date/Time: Fri May 10 21:36:28 CDT 2013
Contact: fantasai@inkedblade.net
Name: fantasai
Report Type: Error Report
Opt Subject: Split kana repeat mark grapheme cluster


Hi! The CSSWG has received an issue report about disallowing 
letter-spacing/justification between U+3033 and U+3034/U+3035. 
I believe this is actually an error in the Unicode spec--the pair 
should form a single grapheme cluster.

See http://lists.w3.org/Archives/Public/www-style/2013Jan/0071.html
and http://lists.w3.org/Archives/Public/www-style/2013May/0282.html

See also feedback from frommail@badral.net regarding NARROW_NO_BREAK_SPACE (202F) on the 6.3 beta PRI feedback page.