Re: New Public Review Issue: Proposed Update UTS #18

From: Asmus Freytag (
Date: Tue Oct 02 2007 - 13:04:14 CST

  • Next message: Michael Maxwell: "RE: New Public Review Issue: Proposed Update UTS #18"

    On 10/2/2007 10:59 AM, Michael Maxwell wrote:
    > I hesitate to jump into this thread, but:
    > Asmus Freytag wrote:
    >> Depending on how many accented letters a language uses,
    >> writing the equivalent expression manually can be both
    >> tedious and error-prone.
    > Aren't there two issues here that need to be separated:
    > (1) the issue of what some regex *means*, e.g. what ^X means, where X is some regex.
    > (2) the question of how easy it is to enter X on a computer.

    In ASCII/English these are tied up inextricably, so that you can't
    always get good guidance on what is the correct (expected) way to extend
    these to other sets/scripts/languages.

    Does ^[a-k] mean "search for terms with initial a,b,c,d,e,f,g,h,i,j,k"
    or does it mean, "search for any term where the initial falls between
    'a' and 'k' inclusive"?

    As long as you *strictly* match by code points, the former
    interpretation is clearly preferred. But the minute you start treating A
    WITH RING and A + COMBINING RING ABOVE as equivalent, this becomes less

    And if you throw in the ability to specify collation elements inside the
    [ ], then you've left behind the assumption that what you are matching
    is strings of character codes and entered the realm where what you are
    matching is strings of grapheme clusters, or collation elements.

    What I'm trying to point out is that you can define regex notations for
    both, but you should probably be consistent and not mix the models.


    > I would hate to make the meaning of some regex counter-intuitive just because it's hard to type with today's software.
    I don't think I was advocating that.

    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 13:07:02 CST