Re: New Public Review Issue: Proposed Update UTS #18

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Oct 02 2007 - 11:07:37 CST

  • Next message: Philippe Verdy: "RE: Re[2]: marks (2 new symbols)"

    On 10/1/2007 9:57 AM, Mike wrote:
    >>> I think it's a bad idea for \q to have the side
    >>> effect of changing the meaning of ".".
    >>
    >> Well if you don't do that, then [^set\q{ch}] becomes inconsistent and
    >> does
    >> not return the user-expected result, i.e. the exact complement of what
    >> [set\q{sh}] matches, according to ".".
    >
    > No, there is no inconsistency. When my compiler encounters a
    > character class, it creates a new matcher object for it; it
    > doesn't use the "." matcher (a predefined object).
    >
    >> [...] as soon as you are introducing collation elements
    >> in regexps, these are sorted by collation, and collations are
    >> locale-sensitive...
    >
    > I don't see why they need to be sorted. All that matters is
    > that you find the longest match. [a-z\q{ch}] will match "ch"
    > in "chinchilla" rather than just "c".
    >
    >> In addition, the meaning of ranges in sets like [a-z] should also be
    >> consistant with the collation used...
    >
    > I disagree with this. I think that having [a-z] magically
    > mean all characters in a particular language is asking for
    > trouble. In French, would you say that [a-z] should match
    > C WITH CEDILLA or A + ACUTE?
    Having that kind of support allows regexes to be written that match, say
    the top half of a list
    by using [a-k] etc. That's something that you can do in English today,
    but not in any other
    language. You need to decide whether extending regexs to other languages
    should allow
    such uses (in which case you think of collation elements and sorting
    order) or not.

    Depending on how many accented letters a language uses, writing the
    equivalent expression manually can be both tedious and error-prone.

    BTW, in Swedish, for example [a-z] would not match all letters. since a
    with ring, a with dieresis and o with dieresis would sort after z. So,
    it's not a question of making [a-z] magic, but whether the elements in
    [ ] are character codes or collation elements.
    >
    > It's my opinion that ranges inside [] should be simple binary
    > order. If you want to do anything fancier, there should be
    > new syntax for it.
    That, or an option?

    Now, other than for canonical decompositions (and conjoining Jamo), I've
    not seen an example that informs me of why it is useful for a regex
    package to be able to match 'ch' as if it were a single code point. Can
    somebody please present a simple example that shows an important use
    case that can't be realized if regexes are limited to a single character
    (plus *canonical* equivalents).

    After all, the atomic elements for writing would be the 'c' and 'h', it
    is only for the purpose of some other text operations that 'ch' are
    (sometimes) considered a unit.



    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 11:10:19 CST