Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (mike-list@pobox.com)
Date: Sun Sep 30 2007 - 16:34:48 CST

  • Next message: Philippe Verdy: "RE: Encoding the fish symbol"

    > The fact that [] is more efficient in regexp engines than a notation using
    > (...|...|...) is just a matter of implementation. My opinion is that such
    > performance difference is a defect of the implementation, i.e. a bug.

    The term "bug" should refer to a situation where the wrong result
    is obtained. Software should be -correct- first and -fast- second.
    If it's fast enough, there is always something else more important
    to spend time on.

    > So, in a shortcut notation like [], you need an additional rule to
    > disambiguate the meaning: you need to parse the set using default grapheme
    > cluster boundaries, so that the characters considered as unbreakable units
    > are the combining sequences (and all their canonical equivalents). So it'sup
    > to the implementation to make sure that [] is effectively a shortcut
    > completely equivalent to (|).

    I'm not sure I agree that you want to look for default grapheme
    cluster boundaries inside a character class. If you list a few
    Hangul L jamos, they will all be jumbled together into a single
    cluster, for example. Also, how would you interpret [a\u0300]?
    As (a|\u0300) or (a\u0300)?

    > Another problem: what is the meaning of [a-e] ? in a language-dependant
    > perspective, it should match all characters between a and e in the
    > language's alphabet. This means that it should match not only single
    > graphemes, but also the possible digraphs (like "ch" or "ch"), i.e. the
    > collation elements.

    I think that [a-e] should -always- mean the five code points,
    U+0061 through U+0065, regardless of locale. Even if you specify
    a locale: \l{es}[a-e], I think it should still mean the same five
    code points, and not add other characters such as "ch" (since it
    is a character in the Spanish locale). In Hawaiian, [a-z] would
    mean [aeiouhklmnpw], which would certainly cause trouble.

    I can see that it might be useful to be able to do this, but I
    would suggest that new syntax should be used to avoid confusion.

    Mike



    This archive was generated by hypermail 2.1.5 : Sun Sep 30 2007 - 16:39:44 CST