RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 01 2007 - 09:40:41 CST

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Mike
    > Envoyé : lundi 1 octobre 2007 15:59
    > À : 'Unicode'
    > Objet : Re: New Public Review Issue: Proposed Update UTS #18
    >
    > > But note that with my notation /\q{ch}./ would NOT be equivalent to
    > /ch./
    > > - the latter regexp will match only 3 characters: /c/ followed by /h/
    > > followed by what /./ matches by default (i.e. [\u0000-\u10FFF] minus the
    > set
    > > of line terminators, which depends on the single line or multi-line mode
    > in
    > > effect, and that I'll note \R).
    > > - the former regexp extends the input universe (matched by ".") by
    > making it
    > > [\u0000-\10FFF\q{ch}] (so that it now contains /c/ or /h/ or the
    > sequence
    > > /ch/).
    >
    > I'll say it again. I think it's a bad idea for \q to have the side
    > effect of changing the meaning of ".".

    Well if you don't do that, then [^set\q{ch}] becomes inconsistent and does
    not return the user-expected result, i.e. the exact complement of what
    [set\q{sh}] matches, according to ".".

    > > For example to match all 3 letters words in Spanish between c and d
    > > (inclusive, but "c" and "d" won't match because they are not 3 letters)
    > one
    > > would use /(?locale=es:(?range:c:d:...))/
    >
    > This seems to be way beyond what I think regular expressions are for.
    > Maybe you should create a little text matching language....

    I did propose it, because as soon as you are introducing collation elements
    in regexps, these are sorted by collation, and collations are
    locale-sensitive...

    In addition, the meaning of ranges in sets like [a-z] should also be
    consistant with the collation used...

    Now if a regexp is locale-sensitive (due to collation), there should be a
    way to create a regexp that is not, i.e. to specify explicitly the locale in
    use.

    If you don't want local-sensitive regexps, you have to specify that your
    implementation supports only one locale, and the simplest locale is the "C"
    locale with binary order of every collation element!

    All this is needed for consistency!



    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 09:45:53 CST