L2/11-164 Date: Apr 23 07:15:48 CDT 2011 Name: C. E. Whitehead (cewcathar@hotmail.com) Subject: Feedback on PRI #182: Proposed Update UTS #18, Unicode Regular Expressions The following comments involve proofreading and thus hardly need to go to the meeting just the typist; I only have one comment that even touches on content; here it is again (it's also in the list below): RL2.2 "Examples" 3rd item "[a-z ñ \q{ch} \q{ll} \q{rr}] Match some lowercase characters in traditional Spanish. " { COMMENT: ? These are all the lowercase characters there are in Spanish; also what do you mean by "traditional Spanish"? It's called either standard Spanish or Spanish; there are of course also dialects, and colloquial Spanish, and "calo," which can mean Chicano or Romani Spanish } => "[a-z ñ \q{ch} \q{ll} \q{rr}] Match lowercase characters in Spanish." Just proofreading comments from me on http://unicode.org/reports/tr18/proposed.html : First I personally am used to the U+ and u+ syntax . . . but perhaps most people would prefer to change. 0 Introduction 2nd par 3rd bulleted item "Level 3" last sentence "However, there is a performance impact to support at this level." { COMMENT: minor word choice: "to" is not the right preposition as you have an "impact on" something not "to" something } => "However, there is a performance impact on support at this level." * * * 0.1 "Notation" about the 7th par (not counting the "Note" or lists of items) "The following notation is defined for use here and in other Unicode documents: \n As used within regular expressions, expands to the text matching the nth parenthesized group in regular expression." { COMMENT: missing article "a" before "regular expression" } => "The following notation is defined for use here and in other Unicode documents: \n As used within regular expressions, expands to the text matching the nth parenthesized group in a regular expression." * * * 0.1 "Notation" last "Note" "Because any character could occur as a literal in a regular expression, when regular expression syntax is embedded within other syntax it can be difficult to determine where the end of the regex expression is." { COMMENT: two things; first verb tense; when a sentence is in the present tense then modals if possible should really be in the present tense; thus to my ear "could" should read "can" here; second "regex expression" -- don't you mean just "regex"? That is, is "expression" included in "regex" is "regex" an abbreviation for "regular expression"? } "Because any character can occur as a literal in a regular expression, when regular expression syntax is embedded within other syntax it can be difficult to determine where the end of the regex is." * * * 0.2 "Conformance" 2nd par last sentence "While the API examples generally follow Java style, it is again only for illustration." { COMMENT: the word "this" would be a better choice than "it" here, since "this" would clearly refer back to following "Java style", while "it" could also be an impersonal verb; not important. } => ? "While the API examples generally follow Java style, this is again only for illustration." * * * 1.1.1 Hex Notation and Normalization, 2nd par, 2nd sentence (1rst major par however) "Literal text in regular expressions may be normalized . . . For example, a regular expression may contain a sequence of literal characters 'u' and grave, such as the expression [aeiou ` ´ ¨¨] (the last three character being U+0300 ( ` ) COMBINING GRAVE ACCENT, U+0301 ( ´ ) COMBINING ACUTE ACCENT, and U+0308 ( ¨ ) COMBINING DIAERESIS. { COMMENT: missing closing parentheses at the end of this. } => ". . . For example, a regular expression may contain a sequence of literal characters 'u' and grave, such as the expression [aeiou ` ´ ¨¨] (the last three character being U+0300 ( ` ) COMBINING GRAVE ACCENT, U+0301 ( ´ ) COMBINING ACUTE ACCENT, and U+0308 ( ¨ ) COMBINING DIAERESIS)." * * * 1.2 Properties "See Section 2.7, Full Properties, also UAX #44, Unicode Character Database [UAX44] and Chapter 4 in the Unicode Standard [Unicode]." { COMMENT: to make it clear that "Chapter 4" is not part of "UAX #44" I would insert a comma after "[UAX44]" } => "See Section 2.7, Full Properties, also UAX #44, Unicode Character Database [UAX44], and Chapter 4 in the Unicode Standard [Unicode]." * * * 1.2 Properties "Note" after 3rd par "Note: it may be a useful implementation technique to load the Unicode tables that support properties and other features on demand, to avoid unnecessary memory overhead for simple regular expressions that do not use those properties." { COMMENT: in many cases, when you have an "in order to" type of clause (such as the clause beginning with "to" here) modifying a sentence or phrase, in order to make it clear that the clause modifies the entire sentence or phrase, it's best to put that clause first; it's up to you though } => ? "Note: to avoid unnecessary memory overhead for simple regular expressions that do not use those properties, it may be a useful implementation technique to load the Unicode tables that support properties and other features on demand." * * * RL1.2a Compatibility Properties General Category Property 3rd Par not counting note "For more information on the meaning of these values, seeUAX #44, Unicode Character Database [UAX44]." { COMMENT: insert white space between "see" and "UAX." } => "For more information on the meaning of these values, see UAX #44, Unicode Character Database [UAX44]." * * * RL1.2 "Script Property" last par "Note, however, that the values for such a property is likely to be extended over time as new information is gathered on the use of characters with different scripts." { COMMENT: "values" not "property" is the subject of the verb here; thus the verb should not be "is" but "are"; this erroneous sentence occurs elsewhere } => "Note, however, that the values for such a property are likely to be extended over time as new information is gathered on the use of characters with different scripts." * * * RL1.3 Subtraction and Intersection, 1rst Par "To meet this requirement, an implementation shall supply mechanisms for union, intersection and set-difference of Unicode sets." { COMMENT: elsewhere when you have three or more items separated by commas you have commas separating each item from the others, even the last item, when "and" is used; for consistency you should follow this style here and insert a comma between "intersection" and "and" } => "To meet this requirement, an implementation shall supply mechanisms for union, intersection, and set-difference of Unicode sets." * * * RL1.3 Subtraction and Intersection, Last Par "Binding or precedence may vary by regular expression engine, so it is safest to always disambiguate using brackets to be sure." { COMMENT: "using brackets to be sure" at the end of this is confusing; it can refer to either what you are disambiguating or how you are disambiguating it. } => ? "Binding or precedence may vary by regular expression engine, so it is always safest to use brackets to disambiguate." * * * 1.5 Simple Loose Matches 1rst Par "Most regular expression engines offer caseless matching as the only loose matching. If the engine does offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII." { COMMENT on "does offers" -- these two words do not go together; if you keep the modal "does" then "offers" needs to be in infinitive form; if you conjugate "offers" then you can't say "does." } => "Most regular expression engines offer caseless matching as the only loose matching. If the engine does offer this, then it needs to account for the large range of cased Unicode characters outside of ASCII." or => ? "Most regular expression engines offer caseless matching as the only loose matching. If the engine offers this, then it needs to account for the large range of cased Unicode characters outside of ASCII." * * * RL 1.5 "Simple Loose Matches" 3rd Par "In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase." { COMMENT: Do you mean the same "uppercase or lowercase form"? } => ? "In addition, because of the vagaries of natural language, there are situations where two different Unicode characters have the same uppercase or lowercase form." * * * RL1.6 "Line Boundaries" Item 2 "Logical beginning of line" "*SOL is at the start of a file or string, and depending on matching options, also immediately following any occurrence of a new line sequence." { COMMENT: word choice? I would use "occurs" instead of "is" } => "SOL occurs at the start of a file or string, and depending on matching options, also immediately following any occurrence of a new line sequence." * * * RL1.6 "Line Boundaries" Item 3 "Logical end of line (often "S") "*EOL at the end of a file or string, and depending on matching options, also immediately preceding a final occurrence of newline sequence." {COMMENT: you ellipsed the verb here; I think you should have a verb; again the verb I like is "occurs" even though it sounds a bit repetitive } => "*EOL occurs at the end of a file or string, and depending on matching options, also immediately preceding a final occurrence of newline sequence." * * * RL2.2 "Extended Grapheme CLusters" 3rd par "More generally, it is useful to have zero width boundary detections for each of the different kinds of segment boundaries defined by Unicode ([UAX29] and [UAX14])." { COMMENT: elsewhere "zero-width boundary detections" is hyphenated as this phrase should be; when you have several words conjoined to modify a single noun then you hyphenate } => "More generally, it is useful to have zero-width boundary detections for each of the different kinds of segment boundaries defined by Unicode ([UAX29] and [UAX14])." * * * RL2.2 "Examples" 3rd item "[a-z ñ \q{ch} \q{ll} \q{rr}] Match some lowercase characters in traditional Spanish. " { COMMENT: ? These are all the lowercase characters in Spanish; what do you mean by "traditional Spanish" it's standard Spanish or Spanish; there are dialects and colloquial Spanish and "calo" which can mean Chicano or Romani Spanish } => "[a-z ñ \q{ch} \q{ll} \q{rr}] Match lowercase characters in Spanish." * * * RL2.2.1 Last Par "The latter will never match in grapheme cluster mode, since it would only match if there were a grapheme cluster boundary after the x and if x is followed by \u0308, but that can never happen simultaneously." { COMMENT: verb tense; you have "would only match" and "if there were" use past tense verb forms because you have an unreal condition here; however you've left the next verb in the present tense form when it should match the others } => "The latter will never match in grapheme cluster mode, since it would only match if there were a grapheme cluster boundary after the x and if x were followed by \u0308, but that can never happen simultaneously." or { if you do not feel you need an unreal condition here; see http://www.englishpage.com/conditional/presentconditional.html } => "The latter will never match in grapheme cluster mode, since it will only match if there is a grapheme cluster boundary after the x and if x is followed by \u0308, but that can never happen simultaneously." * * * RL2.3 Default Word Boundaries, "Note" (after 1rst 2 pars) "Note: Word boundaries and "soft" line break boundaries (where one could break in line wrapping) are not generally the same; line breaking has a much more complex set of requirements to meet the typographic requirements of different languages." { COMMENT: two things; first "could" is really in the past tense or conditional tense but the rest of the sentence is in the simple present; "could" should be "can;" second having "to meet the typographic requirements of different languages" at the end does not sound right; since this clause modifies a whole phrase and not just the word requirements, this clause is much clearer when placed in front of the phrase } => "Note: Word boundaries and "soft" line break boundaries (where one can break in line wrapping) are not generally the same; to meet the typographic requirements of different languages, line breaking has a much more complex set of requirements." * * * RL2.5 Name Properties "Individually Named Characters" par 3 "An implementation may also choose to allow namespaces, where some prefix like "LATIN LETTER" is set globally and used if there is no match otherwise." { COMMENT: "some" goes with a plural noun but you have "some" before the singular noun "prefix" } => "An implementation may also choose to allow namespaces, where a prefix like "LATIN LETTER" is set globally and used if there is no match otherwise." or => "An implementation may also choose to allow namespaces, where some prefixes such as "LATIN LETTER" are set globally and used if there is no match otherwise." * * * RL2.6 Wildcards in Property Values 2nd par; follows "To meet this requirement" "The regular expression must support at least wildcards; other regular expressions features are recommended but optional." { COMMENT: "regular expressions features" ? But, "regular expressions" functions as an adjective here, in which case there should be no "s" at the end of "expressions;" => "regular expression"; see "General Category Property" bulleted item following the list of abbreviations where you say "regular expression languages" } => "The regular expression must support at least wildcards; other regular expression features are recommended but optional." * * * RL2.7 Full Properties Last par "Note, however, that the values for such a property is likely to be extended over time as new information is gathered on the use of characters with different scripts." { COMMENT: same error as previously; the verb's subject is "values" not "property" and thus "is" should be "are" (since "values" is a plural noun) } => "Note, however, that the values for such a property are likely to be extended over time as new information is gathered on the use of characters with different scripts." Best, --C. E. Whitehead cewcathar@hotmail.com