L2/11-164

Date:	Apr 23 07:15:48 CDT 2011
Name:         C. E. Whitehead (cewcathar@hotmail.com)
Subject:  Feedback on PRI #182: Proposed Update UTS #18, Unicode Regular Expressions

The following comments involve proofreading and thus hardly need to go to the meeting just the typist;
I only have one comment that even touches on content; here it is again (it's also in the list below):

RL2.2 "Examples" 3rd item

"[a-z ñ \q{ch} \q{ll} \q{rr}] Match some lowercase characters in traditional Spanish. "

{ COMMENT:  ?  These are all the lowercase characters there are in Spanish; also what do you mean by
"traditional Spanish"? It's called either standard Spanish or Spanish; there are of course also dialects,
and colloquial Spanish, and "calo," which can mean Chicano or Romani Spanish }

=>

"[a-z ñ \q{ch} \q{ll} \q{rr}] Match lowercase characters in Spanish."

Just proofreading comments from me on

http://unicode.org/reports/tr18/proposed.html :

First I personally am used to the U+ and u+ syntax . . . but perhaps most people would prefer to change.

0 Introduction 2nd par 3rd bulleted item "Level 3" last sentence 

"However, there is a performance impact to support at this level."

{ COMMENT:  minor word choice:  "to" is not the right preposition as you have an "impact on" something not "to" something }

=>

"However, there is a performance impact on support at this level."

* * *
0.1 "Notation" about the 7th par (not counting the "Note" or lists of items)

"The following notation is defined for use here and in other Unicode documents:
\n As used within regular expressions, expands to the text matching the nth parenthesized
group in regular expression."

{ COMMENT:  missing article "a" before "regular expression" }

=>

"The following notation is defined for use here and in other Unicode documents:
\n As used within regular expressions, expands to the text matching the nth parenthesized
group in a regular expression."


* * *

0.1 "Notation" last "Note"

"Because any character could occur as a literal in a regular expression, when regular expression
syntax is embedded within other syntax it can be difficult to determine where the end of the regex expression is."

{ COMMENT:  two things; first verb tense; when a sentence is in the present tense then modals if
possible should really be in the present tense; thus to my ear "could" should read "can" here;
second "regex expression" -- don't you mean just "regex"?  That is, is "expression" included in
"regex" is "regex" an abbreviation for "regular expression"? }

"Because any character can occur as a literal in a regular expression, when regular expression syntax
is embedded within other syntax it can be difficult to determine where the end of the regex is."

* * *

0.2 "Conformance" 2nd par last sentence

"While the API examples generally follow Java style, it is again only for illustration."

{ COMMENT:  the word "this" would be a better choice than "it" here, since "this" would clearly refer back
to following "Java style", while "it" could also be an impersonal verb; not important. }

=> ?

"While the API examples generally follow Java style, this is again only for illustration."
* * *


1.1.1 Hex Notation and Normalization, 2nd par, 2nd sentence (1rst major par however)

"Literal text in regular expressions may be normalized . . .

For example, a regular expression may contain a sequence of literal characters 'u' and grave,
such as the expression [aeiou ` ´ ¨¨] (the last three character being U+0300 ( ` ) COMBINING
GRAVE ACCENT, U+0301 ( ´ ) COMBINING ACUTE ACCENT, and U+0308 ( ¨ ) COMBINING DIAERESIS.


{ COMMENT:  missing closing parentheses at the end of this. }

=>

". . . 

For example, a regular expression may contain a sequence of literal characters 'u' and grave,
such as the expression [aeiou ` ´ ¨¨] (the last three character being U+0300 ( ` ) COMBINING GRAVE
ACCENT, U+0301 ( ´ ) COMBINING ACUTE ACCENT, and U+0308 ( ¨ ) COMBINING DIAERESIS)."

* * *

1.2 Properties

"See Section 2.7, Full Properties, also UAX #44, Unicode Character Database [UAX44] and
Chapter 4 in the Unicode Standard [Unicode]."

{ COMMENT:  to make it clear that "Chapter 4" is not part of "UAX #44" I would insert a comma after "[UAX44]" }

=>
"See Section 2.7, Full Properties, also UAX #44, Unicode Character Database [UAX44], and Chapter 4 
in the Unicode Standard [Unicode]."


* * *

1.2 Properties  "Note" after 3rd par

"Note:  it may be a useful implementation technique to load the Unicode tables that support properties 
and other features on demand, to avoid unnecessary memory overhead for simple regular expressions that 
do not use those properties."

{ COMMENT:  in many cases, when you have an "in order to" type of clause (such as the clause beginning 
with "to" here) modifying a sentence or phrase, in order to make it clear that the clause modifies the 
entire sentence or phrase, it's best to put that clause first; it's up to you though }

=> ?

"Note:  to avoid unnecessary memory overhead for simple regular expressions that do not use those
 properties, it may be a useful implementation technique to load the Unicode tables that support 
properties and other features on demand." 

* * *
RL1.2a Compatibility Properties General Category Property 3rd Par not counting note


"For more information on the meaning of these values, seeUAX #44, Unicode Character Database [UAX44]."

{ COMMENT: insert white space between "see" and "UAX." }
=>

"For more information on the meaning of these values, see UAX #44, Unicode Character Database [UAX44]."

* * *

RL1.2 "Script Property" last par

"Note, however, that the values for such a property is likely to be extended over time as new 
information is gathered on the use of characters with different scripts."

{ COMMENT: "values" not "property" is the subject of the verb here; thus the verb should not be 
"is" but "are"; this erroneous sentence occurs elsewhere }

=>

"Note, however, that the values for such a property are likely to be extended over time as new 
information is gathered on the use of characters with different scripts."

* * *

RL1.3 Subtraction and Intersection, 1rst Par

"To meet this requirement, an implementation shall supply mechanisms for union, intersection and 
set-difference of Unicode sets."

{ COMMENT:  elsewhere when you have three or more items separated by commas you have commas separating 
each item from the others, even the last item, when "and" is used; for consistency you should follow 
this style here and insert a comma between "intersection" and "and" } 

=>

"To meet this requirement, an implementation shall supply mechanisms for union, intersection, and 
set-difference of Unicode sets."

* * *
RL1.3 Subtraction and Intersection, Last Par

"Binding or precedence may vary by regular expression engine, so it is safest to always disambiguate 
using brackets to be sure."


{ COMMENT:  "using brackets to be sure" at the end of this is confusing; it can refer to either what 
you are disambiguating or how you are disambiguating it. }

=> ?

"Binding or precedence may vary by regular expression engine, so it is always safest to use brackets 
to disambiguate."

* * *

1.5 Simple Loose Matches  1rst Par

"Most regular expression engines offer caseless matching as the only loose matching.  If the engine 
does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII."


{ COMMENT on "does offers" -- these two words do not go together; if you keep the modal "does" then
"offers" needs to be in infinitive form; if you conjugate "offers" then you can't say "does." }
=>

"Most regular expression engines offer caseless matching as the only loose matching.  If the engine does 
offer this, then it needs to account for the large range of cased Unicode characters outside of ASCII."

or => ?

"Most regular expression engines offer caseless matching as the only loose matching.  If the engine offers 
this, then it needs to account for the large range of cased Unicode characters outside of ASCII."

* * *

RL 1.5 "Simple Loose Matches" 3rd Par

"In addition, because of the vagaries of natural language, there are situations where two different 
Unicode characters have the same uppercase or lowercase."

{ COMMENT:  Do you mean the same "uppercase or lowercase form"? }
=> ?

"In addition, because of the vagaries of natural language, there are situations where two different 
Unicode characters have the same uppercase or lowercase form."

* * *

RL1.6 "Line Boundaries" Item 2  "Logical beginning of line"


"*SOL is at the start of a file or string, and depending on matching options, also immediately 
following any occurrence of a new line sequence."

{ COMMENT:  word choice?  I would use "occurs" instead of "is" }

=>

"SOL occurs at the start of a file or string, and depending on matching options, also immediately 
following any occurrence of a new line sequence."

* * *

RL1.6 "Line Boundaries" Item 3 "Logical end of line (often "S")

"*EOL at the end of a file or string, and depending on matching options, also immediately preceding 
a final occurrence of newline sequence."

{COMMENT:  you ellipsed the verb here; I think you should have a verb; again the verb I like is 
"occurs" even though it sounds a bit repetitive }

=>
"*EOL occurs at the end of a file or string, and depending on matching options, also immediately 
preceding a final occurrence of newline sequence."


* * *
RL2.2 "Extended Grapheme CLusters" 3rd par

"More generally, it is useful to have zero width boundary detections for each of the different 
kinds of segment boundaries defined by Unicode ([UAX29] and [UAX14])."

{ COMMENT:  elsewhere "zero-width boundary detections" is hyphenated as this phrase should be; 
when you have several words conjoined to modify a single noun then you hyphenate }
=>

"More generally, it is useful to have zero-width boundary detections for each of the different
kinds of segment boundaries defined by Unicode ([UAX29] and [UAX14])."

* * *

RL2.2 "Examples" 3rd item

"[a-z ñ \q{ch} \q{ll} \q{rr}] Match some lowercase characters in traditional Spanish. "

{ COMMENT:  ?  These are all the lowercase characters in Spanish; what do you mean by "traditional 
Spanish" it's standard Spanish or Spanish; there are dialects and colloquial Spanish and "calo" which 
can mean Chicano or Romani Spanish }

=>
"[a-z ñ \q{ch} \q{ll} \q{rr}] Match lowercase characters in Spanish." 
* * *

RL2.2.1 Last Par

"The latter will never match in grapheme cluster mode, since it would only match if there were a 
grapheme cluster boundary after the x and if x is followed by \u0308, but that can never happen simultaneously."
{ COMMENT:  verb tense; you have "would only match" and "if there were" use past tense verb forms 
because you have an unreal condition here; however you've left the next verb in the present tense 
form when it should match the others }

=>
"The latter will never match in grapheme cluster mode, since it would only match if there were a 
grapheme cluster boundary after the x and if x were followed by \u0308, but that can never happen 
simultaneously."

or  { if you do not feel you need an unreal condition here; see http://www.englishpage.com/conditional/presentconditional.html }

=>
"The latter will never match in grapheme cluster mode, since it will only match if there is a 
grapheme cluster boundary after the x and if x is followed by \u0308, but that can never happen 
simultaneously."

* * *

RL2.3 Default Word Boundaries,  "Note" (after 1rst 2 pars)

"Note:  Word boundaries and "soft" line break boundaries (where one could break in line wrapping) are 
not generally the same; line breaking has a much more complex set of requirements to meet the 
typographic requirements of different languages."

{ COMMENT:  two things; first "could" is really in the past tense or conditional tense but the 
rest of the sentence is in the simple present; "could" should be "can;"
second having "to meet the typographic requirements of different languages" at the end does not 
sound right; since this clause modifies a whole phrase and not just the word requirements, this 
clause is much clearer when placed in front of the phrase }

=>


"Note:  Word boundaries and "soft" line break boundaries (where one can break in line wrapping) 
are not generally the same; to meet the typographic requirements of different languages, line 
breaking has a much more complex set of requirements."

* * *

RL2.5 Name Properties  "Individually Named Characters"  par 3


"An implementation may also choose to allow namespaces, where some prefix like "LATIN LETTER" 
is set globally and used if there is no match otherwise."

{ COMMENT: "some" goes with a plural noun but you have "some" before the singular noun "prefix" }
=>

"An implementation may also choose to allow namespaces, where a prefix like "LATIN LETTER" is 
set globally and used if there is no match otherwise."

or =>

"An implementation may also choose to allow namespaces, where some prefixes such as "LATIN 
LETTER" are set globally and used if there is no match otherwise."

* * *

RL2.6 Wildcards in Property Values  2nd par; follows "To meet this requirement"

"The regular expression must support at least wildcards; other regular expressions features 
are recommended but optional."

{ COMMENT:  "regular expressions features" ? But, "regular expressions" functions as an adjective 
here, in which case there should be no "s" at the end of "expressions;" => "regular expression";
see "General Category Property" bulleted item following the list of abbreviations where you say 
"regular expression languages" }
=>

"The regular expression must support at least wildcards; other regular expression features are 
recommended but optional."

* * *

RL2.7 Full Properties Last par

"Note, however, that the values for such a property is likely to be extended over time as new 
information is gathered on the use of characters with different scripts."

{ COMMENT:  same error as previously; the verb's subject is "values" not "property" and thus 
"is" should be "are" (since "values" is a plural noun) } 
=>

"Note, however, that the values for such a property are likely to be extended over time as 
new information is gathered on the use of characters with different scripts."


Best,


--C. E. Whitehead

cewcathar@hotmail.com