L2/03-290 Subject: feedback on the last call draft of v8 of the UTS#18 From: Jarkko Hietaniemi Date: 2003-08-25 05:58:37 -0700 1.2 Properties RL 1.2 Properties * After the list, maybe change the "Of these, only General Category and Script are not binary." to "Of these, only General Category and Script are multiply valued, the rest are binary." (I think the sentence flows much better in positive than in negative.) * Should the proposed name of the derived age be "DerivedAge" instead of just "Age"? 1.6 Change "End Of Line" to "Line Boundaries" since the section talks of line boundaries, not just the end of line. * In all items 1-4 why the ordering \u2028 | \u2029 | \u000D\u000A | \u000A | \u000C | \u000D | \u0085 instead of e.g. \u000A | \u000C | \u000D\u000A | \u000D | \u0085 | \u2028 | \u2029 which would be more codepoint-wise and more familiar, leaving the NEL and the Unicode line/para separators in the end? * In "Logical line number" maybe mention that different implementations may call the first line either line zero or line one. * In "Logical beginning of line" add a note that there may be a separate pattern for "beginning of text" for multiline mode which matches only at the beginning of the first line, in Perl \A. * Also, change "start of a file or string" to "start of a string". I don't think "files" should be brought in to confuse the issues even more: "strings" are read from "files" or "streams" or "channels" or whatever, and regexp engines should concern themselves with simply "strings". * In "Logical end of line" add a note pointing out that there are _three_ variants of "end of line". Firstly, please change the description of the describe case to be: (*) "EOL is at the end of a string, and also immediately preceding a string-ending occurrence of:" - "is" added, "a file or" removed (see above for rationale) - "any" changed to "a file- or string-ending" because that is really the most common semantics, not "any" (intra-string) newline. The (*) is the most common semantics for EOL, but there are more. (a) EOL matches at the end of the string (b) EOL matches before string-ending newline (c) EOL matches before any newline In other words, the most common semantics is "(a) or (b)". The different EOL modes might be enabled either by different regular expression patterns or by different matching modes. For example in Perl while '$' is the "(a) or (b) in single-line mode", '\Z' means "(a) or (b) regardless of multi-line mode", and '\zŠ means "match only at the end of string". The main point being that SOL and EOL are _not__ symmetric because of the asymmetry caused by multliine matching. I might have muddled the message, for an absolutely clear explanation see the section 'Anchors and Other "Zero-Width Assertions"' in the chapter 3, 'Overview of Regular Expression Features and Flavors', section 'Common Metacharacters and Features', pp 127-8 in my copy of Jeffrey Friedl's "Mastering Regular Expressions", 2nd Edition 2002, O'Reilly and Associates, ISBN 0-596-00289-0. You might also include this book in the References section, since the 2nd edition covers the regular expressions, in several programming languages/environments and includes Unicode features. * In the last subitem of "Arbitrary character pattern" maybe add the word "reverse" in the second "the sequence", makes it even more obvious to spot the difference. 2.3 Change "Default Words" to "Default Word Boundaries" (Ditto for "RL2.3 Default Words" immediately following) * In the "Note" I am not certain of the last sentence: "However, line breaks are not generally relevant to general regular expression engines." Umm, well. If you mean here multi-line matching, I would object to that statement: line breaks are important, especially if one considers all the complications of multi-line matching. Of course, if you by "line-breaking" really mean the more complex scenarious of UAX#14, then yes, you are right, regular expressions do not care that much about those. 2.5 Name Properties In the subsection "Individually Named Characters" change and "-" (the character "_" into and the hyphen character "-" (the underbar character "_" since font differences might make them a bit hard to differentiate. 3.1 Tailored Punctuation Maybe add an example of tailored punctuation? 3.2 Tailored Grapheme Clusters Maybe add an example of tailored grapheme clusters, e.g. using the Spanish example? 3.3 Change "Tailored Words" to "Tailored Word Boundaries" Maybe add an example of tailored word boundaries? (may be hard to have a simple example...) 3.5 Tailored Ranges Maybe add an example of tailored loose matches, e.g. using the Spanish example? That concludes my comments on the DRAFT of Version 8 of the UTS #18.