Source: Mark Davis
Date: 2007-01-21
Subject: UTS #18 update

I had the action to do:


Mark Davis

Issue a proposed update to UTS #18 (document revision 12) that addresses the issue in L2/05-121.

That document has:

1. Enabling the 'any character' to be newline sequence aware when it is in not in a mode that requires it stops at newline sequences (what I refer to as the default mode) should be optional in the standard.

2. The document could propose some means to match 'any character' while treating newline sequences as a single character but advises against using the standard 'any character' meta character (dot) to achieve this purpose and suggests (mandates?) introducing a new meta character.

I believe that this is already done in the existing version of #18, with the following two paragraphs:

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (e.g. in #1). It would thus be shorthand for
([\u000A\u000B\u000C\u000D\u0085\u2028\u2029] | \u000D\u000A).

Note: For some implementations, there may be a performance impact in recognizing CRLF as a single entity, such as with an arbitrary pattern character ("."). To account for that, an implementation may satisfy R1.6 if there is a mechanism available for converting the sequence CRLF to a single line boundary character before regex processing.

Thus I believe this action should count as done. However, I wanted to bring this issue to the UTC's attention in case there is disagreement on this issue.

(Note: I have a draft update of #18, but it currently only has minor editing fixes for items noted by Julie and Asmus, so I wouldn't recommend issuing a new version until there is something more substantial.)