L2/07-250

Comments on Public Review Issues
(May 14, 2007 - August 02, 2007)

The sections below contain comments received on the open Public Review Issues as of August 2, 2007, since the previous cumulative document was issued prior to UTC #111 (May 2007).

Contents:

102 Proposed Update to UAX #15: Unicode Normalization Forms
103 Proposed Update to UAX #29: Text Boundaries
104 Proposed Update to UAX #31: Identifier and Pattern Syntax
105 Proposed Update to UAX #14: Line Breaking Properties
107 Script Property Values for some characters in U+3200..U+33FF
Other Reports
Closed Public Review Issues


102 Proposed Update to UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

103 Proposed Update to UAX #29: Text Boundaries

Date/Time: Wed Jul 25 07:49:59 CDT 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: UAX29-12 text boundaries: the case of apostrophes

UAX#29 includes an optional rule for handling the case of apostrophes, and just discusses the case of French and Italian when it is used as a mark for elision of letters and contraction of two words :

Break between apostrophe and vowels (French, Italian). WB5a. apostrophe ÷ vowels

Note that this case also occurs in English (in "It's a pity...", the words are effectively "it" and the contraction of the verb "is" with the leading letter elided, however it is more ambiguous in English because "'s" is also used as a genitive suffix as in "Bob's friend" or sometimes as plural suffix, although this suffix is often contracted once more by dropping the "s" and keeping only the apostrophe after a word ending in "s" or "sh" if this is a genitive mark as in "Tess' friends" and not a plural where the apostrophe will often be replaced by a "e".)

But it only allows the ASCII apostrophe (U+0027) and the right curly apostrophe (or right single quotation mark U+2019). There is now the case where the apostrophe is used as a glottal mark for transcripting (for example) Polynesian languages (like Tahitian).

For example the city of "Faa‘a" which should better use a right curly apostrophe (or sometimes a more technical character, almost never seen) i.e. U+2018, but is commonly written with one of the two other apostrophes (and in the official French IGN toponymy or INSEE administrative division names, this Polynesian glottal letter is most simply omitted, producing just "Faaa"). The effective choice of the character is most often made based only on typographical considerations, they are recognized as equivalent in these languages.

Is there a way to include this U+2018 (left single 6-shaped quotation mark) as another possible encoding for this apostrophe character?

I am not suggesting adding the prime symbols, or the spacing acute and grave accents, because they are perceived as wrong (although they may be easily confused, or present due to the initial usage of a limited legacy charsets).

104 Proposed Update to UAX #31: Identifier and Pattern Syntax

No feedback was received via the reporting form this period.

105 Proposed Update to UAX #14: Line Breaking Properties

Date/Time: Wed Jul 25 07:05:11 CDT 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: UAX14-20 undesirable line breaking opportunities (parentheses, quotation marks)

The line breaking opportunities does not seem to handle some special cases related to undesirable line breaks that are currently allowed. This comes for example with parentheses, that currently always allow line breaks after or before them and text they surround.

I can cite an example, in the officially documented French toponyms: "Château-Chinon(Ville)" and "Château-Chinon(Campagne)" which are designating two distinct French communes, and form a single compound name. The INSEE officially writes them WITHOUT a space separator (then the term within parentheses is not a common word but part of the toponym, so it takes a mandatory capital.

In this case, allowing a line break before the opening parenthese would allow a rendering where the line break, if inserted would be interpreted as if there was a space, and the required capital on the term "Ville" or "Campagne" between parentheses would look like a typo.

Note the difference with the French names of a few cantons that are *qualified* by adding " (ville)" or " (campagne)" with a space separator and no capital for the specifier (this occurs for example in the canton and arrondissement around the French city (toponym) of "Strasbourg". The generated name is NOT creating a compound name.

Note the difference with toponyms (or other proper names) that would be otherwise written as "...-Ville" or "...-Campagne": in this case the linebreak is possible after the hyphen, which remains when a line break occurs and still explicitly marks that this is a compound name.

For strange reasons, the INSEE reference for French administrative units (and the IGN, for its official toponyms) have used parentheses instead of an hyphen.

How to handle this case, in a way so that parentheses will not allow a linebreak on BOTH sides of parentheses if they are surrounded by parentheses?

I can give another more common example where such linebreaks are undesirable: "un (ou plusieurs) mot(s)" Note how the "s" plural mark in "mots" is marked as an alternative; it is not separable from the word it normally completes. inserting a linebreak between "mot" and "(s)" would be wrong.

Another example when writing maths formulas "f(x) = x + 2". Here again, the term "f(x)" should remain unbreakable. The same should occur as well with the term "f[x]" in "f[x] = x + 2".

I propose disallowing line breaks around ***BOTH*** sides of:
* (parentheses), or parenthese-like characters like
* [square brackets],
* ‹angle brackets or quotation marks› (we can accept it for lower than and higher than signs), or even
* “double 6/9 quotation marks”, or
* «double angle quotation marks», or
* ‘single 6/9 quotation marks’, or
if and only if, the characters that are on each side of the marks would be unbreakable in absence of these marks.

This will also cover the case where ‘single 6/9 quotation marks’ are also used as apostrophes (common in French, English to mark elision of letters or some abbreviated words) or reversed apostrophes (used in polynesian languages as a glottal consontal mark).

Date/Time: Wed Jul 25 07:19:05 CDT 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: UAX14-20 undesirable line breaking opportunities (parentheses, quotation marks)

As an alternative to my proposal, parentheses or quotation marks could also be described by making them inherit the line breaking opportunity property from the character they immediately surround, while keeping the prohibition of linebreking between the parenthese/quotation mark and the inner character it touches.

This would correctly handle the case of parentheses used to surround ideographs, because parentheses should not be detached from the inner ideograph, despite they should still remain breakable from the outer character, as if these characters were absent from the text (so the line break rule would treat parentheses and quotation marks as if they were diacritics and part of a larger unbreakable grapheme cluster with the inner character used as the effective base, and line breaking would be analyzed by first ignoring them but just checking the breaking opportunities between the inner and outer character.)

Date/Time: Wed Jul 25 15:37:03 CDT 2007
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Public Review Issue
Opt Subject: UAX #14 Proposed Update feedback

Two minor editorial issues in the document of note:

1. The sentence just before LB13 states that the rules about opening and closing "...have special behavior with respect to spaces, and therefore come before Rule 19." However, the space rule is actually Rule LB*18*, not 19. This problem was introduced in the 5.0.0 version of UAX #14, when rules were renumbered. The text was correct in the 4.1.0 version, when the text referred to Rule 12, which at that point was in fact the space rule.

2. Rule LB16 has a problem in the example quoted in the rule. It states that one should not break between ']h', but the "h" is clearly a mistake for something else, and should be a representative character from the lb=NS class. This typo has been in the text for a number of versions now as well.

107 Script Property Values for some characters in U+3200..U+33FF

Date/Time: Wed Jul 4 01:24:25 CDT 2007
Contact: cfynn@gmx.net
Name: Chris Fynn
Report Type: Public Review Issue
Opt Subject: PR 107

Circled numbers such as those at U+3251 to U+325F; U+32B1 to 32BF (and those at 2460 to 246F - as well as most other characters in the "Enclosed Alphanumerics" block); "CJK Angle Brackets" U+3008 - U+300B; CJK Corner Brackets U+300C - U+300F; CJK Brackets U+3010, U+3011, U+3014 - U+301B; CJK Symbols & Punctuation U+301D, 301E; and a number of other symbols frequently occur in modern Tibetan publications.

Modern Tibetan (and other "minority" language) documents published in the PRC are created using software and typesetting systems originally and principally designed for creating Chinese (Han) publications. Operators are generally users of Chinese as well as their own "minority" language - so when, for modern publications additional symbols, punctuation, and so on has been adopted they have naturally used those readily available on their typesetting systems and which are familiar from Chinese language publications.

Hence when the properties of *any* CJK Punctuation and Symbols used in China are being considered I think it would be wise to carefully investigate whether these characters are also being used in modern "minority language" publications & data from China written in non-CJK scripts (such as Tibetan) - and, where necessary, the processing requirements for situations where these characters are used in conjuction with those scripts should also be taken into account.

- Chris

Other Reports

Date/Time: Sun Jul 22 09:16:01 CDT 2007
Contact: msd@pobox.com
Name: Michael D'Errico
Report Type: Error Report
Opt Subject: UTS#18 Regular Expressions

In UTS#18, the following statement is made about the inverse of a set containing literal clusters:

"A typical implementation of the inverse of a set containing literal clusters simply removes those strings, thus [^a-z ñ \q{ch} \q{ll} \q{rr}] is equivalent to [^a-z ñ]."

I think that this is bad advice. In my implementation, this is not the case. Consider [^\q{ch}\q{ll}\q{rr}] -- if the literal clusters are simply removed, then this set will be empty and therefore match anything. However, it should not match at the beginning of the string "ch" for example (though it can match at the 'h').

The way I implemented this is I simply perform a match using the set (without inverse) and negate the result. Thus the example I gave does not match the beginning of the string "ch".

Mike

Closed Public Review Issues

No feedback was received via the reporting form this period.