Re: New Public Review Issue: Proposed Update UTS #18

From: Andy Heninger (
Date: Fri Sep 21 2007 - 15:29:29 CDT

  • Next message: Mike: "Unicode Regex Design (was Re: New Public Review Issue: Proposed Update UTS #18)"

    On 9/20/07, Mike <> wrote:
    > > Issue #111 Proposed Update UAX #18: Unicode Regular Expressions
    > >
    > >
    > >
    > > This proposed update clarifies conformance requirements for "." and
    > CRLF.
    > > Public feedback is invited.
    > I disagree with the MUSTs in the proposed text. In my implementation,
    > whether "." matches newline sequences is independent of "multiline
    > mode." Multiline mode affects the behavior of ^ and $, not .; in
    > single line mode, they match only at the beginning or end of the text
    > (or just before a final newline sequence); in multiline mode, ^ matches
    > at the beginning of the string or after any newline sequence, and $
    > matches before any newline sequence or at the end of the string.

    This is my understanding also. Multiline mode only affects the behavior of
    ^ and $, and does not control whether "." matches a new-line sequence.

    The separate option "DotAll" (Java terminology), or "Single Line Mode"
    (classic regex terminology) controls whether "." matches a new line sequence
    or not.

    I think it might be best if, to the extent that we can, we avoid
    descriptions and listings of specific regexp modes, and instead say that any
    operations that are sensitive to newlines must recognize all of the Unicode
    line-ending characters and sequences. The idea is to avoid any implication
    that a list of regexp operations, modes or tests that we include is
    complete, and to avoid having to describe too many things that don't
    directly pertain directly to Unicode.

    You can turn on the DotMatchesNewline and MultilineMatching options
    > separately. As a side note, I implemented "." to match a default
    > grapheme cluster, so A + ACUTE is treated as a single entity, and
    > Hangul syllables are kept together (you can also specify them using
    > \L+\V+\T* if you want).

    I've been contemplating doing something along these lines also, but for more
    than just ".", and for somewhat different reasons. Making the fundamental
    unit of matching be a Grapheme Cluster, so that a plain "A" would not match
    the "A" in "A + ACUTE" would be a clean way to define a canonically
    equivalent match. Not too hard to explain, results are completely
    independent of normalization form, match boundaries would never include part
    of a composed character. Clusters would hang together in the pattern also,
    so that qualifiers (*, +, ?, etc.) would apply to the preceding entire
    cluster, not to the preceding code point.


    Regarding the question of how to complement a [^set] that contains strings,
    or grapheme clusters, or collation elements, or whatever we want to call
    them, I am still struggling with what it means, and what makes sense. I'm
    not sure the concept makes complete sense unless there is some interesting,
    not too big "Universe" from which the original strings could be removed.
    Maybe something language or locale sensitive, although in general I dislike
    the idea of making matching be sensitive to such things.

      -- Andy


    This archive was generated by hypermail 2.1.5 : Fri Sep 21 2007 - 15:31:46 CDT