RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 20:50:23 CDT

Next message: Philippe Verdy: "RE: Normalization in panlingual application"

Previous message: Philippe Verdy: "RE: Normalization in panlingual application"
In reply to: Rick McGowan: "New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Some uncorrected items are not clear enough in RL.1.6:

[quote]
Logical beginning of line (often "^")
(...)
* There is no empty line within the sequence \u000D\u000A.
[/quote]

[quote]
Logical end of line (often "$")
(...)
* There is no empty line within the sequence \u000D\u000A.
[/quote]

The repeted item is suggesting something wrong, because, even in
multilinemode, a file (or complete text) that would contain only CRLF would
still have one empty line with a start of line and an end of line:
* when not in multiline mode, they are at the same position, just before the
CRLF sequence which is not considered as a unique character (matched by ".")
but as a line separator.
* in multiline mode: there's a start of line before the CRLF sequence and an
endof line after the sequence, .

I think that what is really intended here in these two items is:
* There is no empty line in the middle of the sequence \u000D\u000A, i.e.
between the first and second character.

When NOT in multiline mode, EVERY occurence of a CRLF sequence implies an
end of line before the sequence and there's a start of line after the
sequence, if the sequence is not at end of file/text.

When in multiline mode, CRLF sequences are treated like if it was a single
character, but this does invalidate the existence of exactly one start of
line (just before the first character of the text, even if this one is CR,
part of a CRLF sequence) and exactly one end of line (just after the last
character of the text, even if this one is LF, part of a CRLF sequence).

Even for a completely empty file, in multiline mode, the file contains a
start of line and end of line at the same position (the "start of line" and
"end of line" in multiline mode actually means "start of file" and "end of
file"). This means that a search regexp pattern like "^$" in multiline mode
will find a single match for empty files, and each of the two regexps "^" or
"$" will find a single match in EVERY file (even if it has or doesn't have
any character in it), possibly at different position, depending on the
content.

Philippe.

Rick McGowan wrote:
> The Unicode Technical Committee has posted a new issue for public review
> and comment. Details are on the following web page:
> http://www.unicode.org/review/
> Review periods for the new items close on October 10, 2007.
>
> Please see the page for links to discussion and relevant documents.
> Briefly, the new issue is:
>
> Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
> http://www.unicode.org/reports/tr18/tr18-12.html
> This proposed update clarifies conformance requirements for "." and CRLF.
> Public feedback is invited.
>
> If you have comments for official UTC consideration, please post them by
> submitting your comments through our feedback & reporting page:
> http://www.unicode.org/reporting.html

This message was sent to the reporting form.

Next message: Philippe Verdy: "RE: Normalization in panlingual application"
Previous message: Philippe Verdy: "RE: Normalization in panlingual application"
In reply to: Rick McGowan: "New Public Review Issue: Proposed Update UTS #18"
Next in thread: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 20:52:13 CDT