RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 20:50:23 CDT

  • Next message: Philippe Verdy: "RE: Normalization in panlingual application"

    Some uncorrected items are not clear enough in RL.1.6:

    [quote]
    Logical beginning of line (often "^")
    (...)
    * There is no empty line within the sequence \u000D\u000A.
    [/quote]

    [quote]
    Logical end of line (often "$")
    (...)
    * There is no empty line within the sequence \u000D\u000A.
    [/quote]

    The repeted item is suggesting something wrong, because, even in
    multilinemode, a file (or complete text) that would contain only CRLF would
    still have one empty line with a start of line and an end of line:
    * when not in multiline mode, they are at the same position, just before the
    CRLF sequence which is not considered as a unique character (matched by ".")
    but as a line separator.
    * in multiline mode: there's a start of line before the CRLF sequence and an
    endof line after the sequence, .

    I think that what is really intended here in these two items is:
    * There is no empty line in the middle of the sequence \u000D\u000A, i.e.
    between the first and second character.

    When NOT in multiline mode, EVERY occurence of a CRLF sequence implies an
    end of line before the sequence and there's a start of line after the
    sequence, if the sequence is not at end of file/text.

    When in multiline mode, CRLF sequences are treated like if it was a single
    character, but this does invalidate the existence of exactly one start of
    line (just before the first character of the text, even if this one is CR,
    part of a CRLF sequence) and exactly one end of line (just after the last
    character of the text, even if this one is LF, part of a CRLF sequence).

    Even for a completely empty file, in multiline mode, the file contains a
    start of line and end of line at the same position (the "start of line" and
    "end of line" in multiline mode actually means "start of file" and "end of
    file"). This means that a search regexp pattern like "^$" in multiline mode
    will find a single match for empty files, and each of the two regexps "^" or
    "$" will find a single match in EVERY file (even if it has or doesn't have
    any character in it), possibly at different position, depending on the
    content.

    Philippe.

    Rick McGowan wrote:
    > The Unicode Technical Committee has posted a new issue for public review
    > and comment. Details are on the following web page:
    > http://www.unicode.org/review/
    > Review periods for the new items close on October 10, 2007.
    >
    > Please see the page for links to discussion and relevant documents.
    > Briefly, the new issue is:
    >
    > Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
    > http://www.unicode.org/reports/tr18/tr18-12.html
    > This proposed update clarifies conformance requirements for "." and CRLF.
    > Public feedback is invited.
    >
    > If you have comments for official UTC consideration, please post them by
    > submitting your comments through our feedback & reporting page:
    > http://www.unicode.org/reporting.html

    This message was sent to the reporting form.



    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 20:52:13 CDT