Comments on Public Review Issues - October 13, 2007

L2/07-350

Comments on Public Review Issues
(August 02, 2007 - October 13, 2007)

The sections below contain comments received on the open Public Review Issues as of October 13, 2007, since the previous cumulative document was issued prior to UTC #112 (August 2007).

Contents:

102 Proposed Update to UAX #15: Unicode Normalization Forms
103 Proposed Update to UAX #29: Text Boundaries
104 Proposed Update to UAX #31: Identifier and Pattern Syntax
105 Proposed Update to UAX #14: Line Breaking Properties
108 Ideographic Variation Database Submission
109 Proposed Draft UTR #42: An XML representation of the UCD
110 Proposed Update to UAX #24 Script Names
111 Proposed Update to UTS #18 Unicode Regular Expressions
112 Proposed Update to UAX #9 Unicode Bidirectional Algorithm
113 Proposed Update to UTS #10 Unicode Collation Algorithm
114 Proposed Update to UAX #34 Unicode Named Character Sequences
Other Reports
Closed Public Review Issues

102 Proposed Update to UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

103 Proposed Update to UAX #29: Text Boundaries

No feedback was received via the reporting form this period.

104 Proposed Update to UAX #31: Identifier and Pattern Syntax

Date/Time: Wed Sep 12 07:09:40 CDT 2007
Contact: naa.ganesan@gmail.com
Name: Naga Ganesan
Report Type: Public Review Issue
Opt Subject: Feedback on PR-104

Tamil example, like Farsi (Figure 2), is given in: indology2.googlepages.com/ZWJ_semantics.pdf

Please add a Tamil example in Figure 2: (i) Without ZWNJ
பக்ஷி 'bird'
< 0BAA 0B95 0BCD 0BB7 0BBF>
(ii) With ZWNJ
பக்‍ஷி 'Name of a Muslim person'
<0BAA 0B95 0BCD 200C 0BB7 0BBF>

The presence of ZWJ or not substantially changes Word meanings in languages like Marathi, Konkani, Nepali and Newari. For examples where ZWJ presence produces substantial meaning differences, please refer to http://indology2.googlepages.com/ZWJ_semantics.pdf

Tamil example like the Malayalam example in Figure 3.

European (Christian) and Middle Eastern (Islamic) loan words into Tamil are written with ZWNJ.

For Figure 3, it is stated "The Malayalam word for eyewitness. The form without the ZWNJ is incorrect in this case.".

Please add for Tamil, "The Tamil word for 'section' is செக்‍ஷன் < 0B9A 0BC6 0B95 0BCD 200C 0BB7 0BA9 0BCD>. The form without the ZWNJ is incorrect in this case.". Similarly, Muslim names without ZWNJ whenever க்‍ஷ occurs is not correct.

Thanks for adding these cases in UAX #31,

N. Ganesan

105 Proposed Update to UAX #14: Line Breaking Properties

No feedback was received via the reporting form this period.

108 Ideographic Variation Database Submission

No feedback was received via the reporting form this period.

109 Proposed Draft UTR #42: An XML representation of the UCD

No feedback was received via the reporting form this period.

110 Proposed Update to UAX #24 Script Names

No feedback was received via the reporting form this period.

111 Proposed Update to UTS #18 Unicode Regular Expressions

Date/Time: Wed Sep 19 20:45:46 CDT 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: TR18-12: start/end of line in RL1.6

Some uncorrected items are not clear enough in RL.1.6:

[quote]
Logical beginning of line (often "^")
(...)
* There is no empty line within the sequence \u000D\u000A.
[/quote]

[quote]
Logical end of line (often "$")
(...)
* There is no empty line within the sequence \u000D\u000A.
[/quote]

The repeated item is suggesting something wrong, because, even in multiline mode, a file (or complete text) that would contain only CRLF would still have one empty line with a start of line and an end of line:
* when not in multiline mode, they are at the same position, just before the CRLF sequence which is not considered as a unique character (matched by ".") but as a line separator.
* in multiline mode: there's a start of line before the CRLF sequence and an endof line after the sequence, .

I think that what is really intended here in these two items is:
* There is no empty line in the middle of the sequence \u000D\u000A, i.e. between the first and second character.

When NOT in multiline mode, EVERY occurence of a CRLF sequence implies an end of line before the sequence and there's a start of line after the sequence, if the sequence is not at end of file/text.

When in multiline mode, CRLF sequences are treated like if it was a single character, but this does invalidate the existence of exactly one start of line (just before the first character of the text, even if this one is CR, part of a CRLF sequence) and exactly one end of line (just after the last character of the text, even if this one is LF, part of a CRLF sequence).

Even for a completely empty file, in multiline mode, the file contains a start of line and end of line at the same position (the "start of line" and "end of line" in multiline mode actually means "start of file" and "end of file"). This means that a search regexp pattern like "^$" in multiline mode will find a single match for empty files, and the regexps "^" or "$" will find matches in ALL files (even if they have or don't have any content)

Date/Time: Thu Sep 20 15:53:15 CDT 2007
Contact: msd@pobox.com
Name: Michael D'Errico
Report Type: Public Review Issue
Opt Subject: Issue #111 Proposed Update to UTS #18

Regarding the proposed update:

http://www.unicode.org/reports/tr18/tr18-12.html

I disagree with the MUSTs in the proposed text. In my implementation, whether "." matches newline sequences is independent of "multiline mode." Multiline mode affects the behavior of ^ and $, not .; in single line mode, they match only at the beginning or end of the text (or just before a final newline sequence); in multiline mode, ^ matches at the beginning of the string or after any newline sequence, and $ matches before any newline sequence or at the end of the string.

You can turn on the DotMatchesNewline and MultilineMatching options separately. As a side note, I implemented "." to match a default grapheme cluster, so A + ACUTE is treated as a single entity, and Hangul syllables are kept together (you can also specify them using \L+\V+\T* if you want). There is also a DotMatchesDefective option (true by default) which determines whether . will match a defective combining character sequence (or you can look specifically for defective sequences using \F).

Other regular expression features you may be interested in:
  \c    any code point (what you use . for)
  \n    any of the newline sequences including CRLF
        (and \r has no meaning)
  \F    defective combining character sequence
  \g    default grapheme cluster boundary
  \G    complement of \g
  \h    hex digit
  \a    assigned code point
  \A    unassigned code point
  \i    CJK ideograph
  \I    Unified ideograph
  \K    Katakana
  \H    Hiragana
  \m    combining characters (equivalent to \p{M})
  \M    complement of \m
  \L    leading jamo
  \V    vowel jamo
  \T    trailing jamo
\v{version} specifies which version of Unicode to use for character properties, e.g. /\v{4.1}\A+/

\p and \P are similar to what you have defined, but I allow multiple values: \p{gc=L|M|N} and in some cases comparisons: \p{Numeric_Value>=10}, \p{ccc<230}

\u and \U are the same except I got rid of the two extra leading zeros in \U (since a code point is always representable in 24 bits)

\N{name} works with character names and also named character sequences

Mike

Date/Time: Tue Sep 25 12:32:01 CDT 2007
Contact: msd@pobox.com
Name: Michael D'Errico
Report Type: Technical Report or Tech Note issues
Opt Subject: Named character sequences in regular expressions using \N

Mark Davis asked me to report this idea using this form.

In my regular expression code, I allow you to use the \N syntax to specify a named character sequence consisting of multiple code points:

e.g. \N{KATAKANA LETTER AINU P}

This can be used anywhere \N with a character name is allowed. Repetition operators such as * + or {m,n} apply to the sequence of code points, not just the last code point. Also, when placed in a character class, it is equivalent to [\q{\u31F7\u309A}].

Mike

112 Proposed Update to UAX #9 Unicode Bidirectional Algorithm

No feedback was received via the reporting form this period.

113 Proposed Update to UTS #10 Unicode Collation Algorithm

Date/Time: Fri Oct 12 18:06:05 CDT 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTS10/UTS18 interactions (matching characters)

For memory, here is a copy from two mails exchanged in the Unicode mailing list, related to the interaction of two UTS being reviewed; may be it's to late to file this comment now in your agenda, but such point is still important to understand the issue, because a decision on UTS18 could affect the way UTS10 is implemented:

------------------

On October 10, I (Philippe Verdy <verdy_p@wanadoo.fr>) wrote:

I hope that the two UTS updates whose reviewal is closed today will be considered jointly, because they do interact: - UTS 10 (UCA) : for matching characters (see section 8.) - UTS 18 (Regular expressions) : currently very weak at definining the level of matching (characters? Default grapheme clusters like in UTS 10? Or collation elements, the most generic case that contain both definitions, depending only on the tailoring of locales).

This has caused lots of different interpretations and discussions here. If UTS 18 is too restrictive, it may break UTS 10 rules for matching strings (independently of the advanced string matching that UTS 18 provides for alternations and additional conditions, so that regexp searches should still be an extension of UTS 10 for string matching, permitting everything that UTS 10 already provides).

---

Mark Davis replied on October 11:

Note that the closing dates are to allow feedback before the subsequent Unicode Technical Committee meeting. The committee will consider the public feedback that it has received, and the comments from the membership. It doesn't have to advance a proposed update if there are still open issues.

The UAXes do have a timetable, since they have to be ready for Unicode 5.1. However, they will have one more round before they go final, at the UTC meeting early next year. The later they are in the cycle the harder they are to change, so the earlier the feedback is in, the better.

So, if you think there are problems in the text, then you should file a public review form with your concerns. The most effective feedback states clearly -- and concisely -- what each problem is, and if possible, what a recommended change to the text to fix the problem would be.

The review period date for #10 was announced as > Due date for comments to the current draft is: 2007/10/16. so the date on www.unicode.org/review/ is incorrect and will need to be fixed. And certainly if there are issues that depend on the interaction with documents, those can be filed as well.

Mark

114 Proposed Update to UAX #34 Unicode Named Character Sequences

No feedback was received via the reporting form this period.

Other Reports

Date/Time: Wed Sep 26 01:17:58 CDT 2007
Contact: frank_r_schaefer@gmx.net
Name: Frank-Rene Schaefer
Report Type: Error Report
Opt Subject: Unicode Database Properties

Concern: Unicode Database 5.0.

Dear Ladies and Gentlemen.

The file 'DerivedNormalizationProps.txt' contains properties and property aliases (example: "NFD_QC" is an alias, 'Expands_On_NFKD' is a property name).

Parsing for binary properties becomes cumbersome, since one needs to check for every property if it is an alias or a property name. Let all files containing binary properties either use aliases or names but do not mix them.

Best Regards

Frank-Rene Schaefer

Date/Time: Sun Sep 30 19:58:56 CST 2007
Contact: patrick@hapax.qc.ca
Name: Patrick Andries
Report Type: Technical Report or Tech Note issues
Opt Subject: Small typo in www.unicode.org/reports/tr20/

Hello,

In http://www.unicode.org/reports/tr20/ 3.1 Table of Characters not Suitable for use With Markup

I think "U+0340..U+0341 Clones of grave and accent" should read "U+0340..U+0341 Clones of grave and acute accents", or something similar.

Regards,

Patrick

Closed Public Review Issues

No feedback was received via the reporting form this period.