L2/03-290

Subject: feedback on the last call draft of v8 of the UTS#18
From: Jarkko Hietaniemi <jhi@iki.fi>
Date: 2003-08-25 05:58:37 -0700

1.2 Properties

	RL 1.2 Properties

	* After the list, maybe change the
	"Of these, only General Category and Script are not binary." to
	"Of these, only General Category and Script are multiply valued,
	the rest are binary."  (I think the sentence flows much better
	in positive than in negative.)

	* Should the proposed name of the derived age be "DerivedAge"
	instead of just "Age"?

1.6	Change "End Of Line" to "Line Boundaries" since the section
	talks of line boundaries, not just the end of line.

	* In all items 1-4 why the ordering

	\u2028 | \u2029 | \u000D\u000A | \u000A | \u000C | \u000D | \u0085

	instead of e.g.

	\u000A | \u000C | \u000D\u000A | \u000D | \u0085  | \u2028 | \u2029

	which would be more codepoint-wise and more familiar, leaving the
	NEL and the Unicode line/para separators in the end?

	* In "Logical line number" maybe mention that different
	implementations may call the first line either line zero
	or line one.

	* In "Logical beginning of line" add a note that there may
	be a separate pattern for "beginning of text" for multiline
	mode which matches only at the beginning of the first line,
	in Perl \A.

	* Also, change "start of a file or string" to "start of a string".
	I don't think "files" should be brought in to confuse the issues
	even more: "strings" are read from "files" or "streams" or
	"channels" or whatever, and regexp engines should concern themselves
	with simply "strings".

	* In "Logical end of line" add a note pointing out that there
	  are _three_ variants of "end of line".  Firstly, please change
	  the description of the describe case to be:

		(*) "EOL is at the end of a string, and also immediately
		 preceding a string-ending occurrence of:"

		- "is" added, "a file or" removed (see above for rationale)
		- "any" changed to "a file- or string-ending" because that
		  is really the most common semantics, not "any" (intra-string)
		  newline.

	  The (*) is the most common semantics for EOL, but there are more.

		(a) EOL matches at the end of the string
		(b) EOL matches before string-ending newline
		(c) EOL matches before any newline

	  In other words, the most common semantics is "(a) or (b)".
	  The different EOL modes might be enabled either by different
	  regular expression patterns or by different matching modes.
	  For example in Perl while '$' is the "(a) or (b) in single-line
	  mode", '\Z' means "(a) or (b) regardless of multi-line mode",
	  and '\zŠ means "match only at the end of string".

	  The main point being that SOL and EOL are _not__ symmetric
	  because of the asymmetry caused by multliine matching.

	  I might have muddled the message, for an absolutely clear
	  explanation see the section 'Anchors and Other "Zero-Width
	  Assertions"' in the chapter 3, 'Overview of Regular Expression
	  Features and Flavors', section 'Common Metacharacters and
	  Features', pp 127-8 in my copy of Jeffrey Friedl's "Mastering
	  Regular Expressions", 2nd Edition 2002, O'Reilly and Associates,
	  ISBN 0-596-00289-0.  You might also include this book in the
	  References section, since the 2nd edition covers the regular
	  expressions, in several programming languages/environments
	  and includes Unicode features.

	* In the last subitem of "Arbitrary character pattern" maybe
	  add the word "reverse" in the second "the sequence", makes
	  it even more obvious to spot the difference.

2.3 Change "Default Words" to "Default Word Boundaries"
    (Ditto for "RL2.3 Default Words" immediately following)

	* In the "Note" I am not certain of the last sentence:
	  "However, line breaks are not generally relevant to general
	   regular expression engines."  Umm, well.  If you mean here
	  multi-line matching, I would object to that statement: line
	  breaks are important, especially if one considers all the
	  complications of multi-line matching.  Of course, if you by
	  "line-breaking" really mean the more complex scenarious
	  of UAX#14, then yes, you are right, regular expressions
	  do not care that much about those.

2.5 Name Properties

	In the subsection "Individually Named Characters" change
	
		and "-" (the character "_"

	into

		and the hyphen character "-" (the underbar character "_"

	since font differences might make them a bit hard to differentiate.

3.1 Tailored Punctuation

	Maybe add an example of tailored punctuation?

3.2 Tailored Grapheme Clusters

	Maybe add an example of tailored grapheme clusters,
	e.g. using the Spanish example?

3.3 Change "Tailored Words" to "Tailored Word Boundaries"

	Maybe add an example of tailored word boundaries?
	(may be hard to have a simple example...)

3.5 Tailored Ranges

	Maybe add an example of tailored loose matches,
	e.g. using the Spanish example?

That concludes my comments on the DRAFT of Version 8 of the UTS #18.