RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 14:53:51 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    Jonathan Coxhead wrote:
    > I'd just like to point out that a "[ ]" regular expression is defined
    > to
    > match always exactly one character (if it matches at all).

    Why ? This is just an historic limitation in old ASCII-based
    implementations. From a user perspective, the [] notation is just a
    convenient short way to write an alternation between multiple strings making
    up what the user MAY perceive as a single character. If you want to be fait
    with any language, you need to admit that the restriction of [] to
    single-codepoint matches is not relevant.

    The fact that [] is more efficient in regexp engines than a notation using
    (...|...|...) is just a matter of implementation. My opinion is that such
    performance difference is a defect of the implementation, i.e. a bug. From
    the user's perspective, the meaning is not altered.

    Then, there's the problem of regexps like [] : the set contains composite
    characters; to accept such shortcut, it has to remain meaningful, even if
    the regexp is in NFD form, without having to write it explicitly as:
    [\q{}\q{}] (otherwise the set would include also [ae] without the accents,
    and would also include the accents separately).

    So, in a shortcut notation like [], you need an additional rule to
    disambiguate the meaning: you need to parse the set using default grapheme
    cluster boundaries, so that the characters considered as unbreakable units
    are the combining sequences (and all their canonical equivalents). So it'sup
    to the implementation to make sure that [] is effectively a shortcut
    completely equivalent to (|). If only the precomposed characters must be
    matched (and not the canonically equivalent decomposed strings, then you
    need to specify the regexp in a way that can't be interpreted as canonically
    equivalent.

    If a regexp string contains "", it designates all its canonical
    equivalents; to match only the precomposed "", you would need a notation
    specifying that, like "\C{}" for matching the character converted to NFC
    form only, excluding all other canonical equivalents. But then, howto match
    characters that are excluded from recomposition in NFC form?
    * May be this notation should still allow the recompositions (so that
    compatibility characters become matchable)
    * Or more safely, by using another way to specify these compatibility
    characters (like the \uxxxx notation, which can't be interpreted as meaning
    something else than the designated character).

    Another problem: what is the meaning of [a-e] ? in a language-dependant
    perspective, it should match all characters between a and e in the
    language's alphabet. This means that it should match not only single
    graphemes, but also the possible digraphs (like "ch" or "ch"), i.e. the
    collation elements.
    But I think that regexps should not be interpreted ambiguously, unless the
    application knows which locale the user expects by default. Another
    mechanism should be available in regexps to override the default locale.

    Two possibilities:
    * introduce locale specifiers in external regexp flags (remember the flags
    in Perl or PHP or vi/ed/sed after the final slash delimiting the regexp)
    * include in the regexp syntax itself a locale specifier for specific parts
    of the regexps like:
       (?locale=br![a-e]) which means that the set is interpreted within the
    Breton locale where (ch) and (ch) are part of the alphabet, between (b) and
    (d), but NOT (c): a isolated c would NOT be matched in this locale, unless
    you use a extended locale that also includes (c) within the Breton alphabet.
    To specify the historic behaviour, you would simply use (?locale=C!...) or
    (?locale=POSIX!...) for example to ignore the user's default locale.



    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 14:55:37 CDT