Re: New Public Review Issue: Proposed Update UTS #18

From: Mike (
Date: Tue Sep 25 2007 - 13:38:48 CDT

  • Next message: Marnen Laibow-Koser: "Re: New Public Review Issue: Proposed Update UTS #18"

    > Having named character sequences in \N is an interesting idea. Would you
    > mind proposing that to the UTC using the online form? (That's the way to
    > raise issues to the UTC's attention.)


    > BTW, Andy and I concluded that the really effective way to do canonical
    > equivalence in regex would be in a mode where grapheme cluster is the
    > unit, not code point.

    I'm starting to think that we may need to support both modes.

    > On the comment on "feasible" -- I think the reference there was to
    > language/locale-sensitive regex. That involves a few things which are
    > quite tricky, and are thus listed under Level 3 in UTS#18.
    > * sensitivity: "aa" matches a-ring in Danish
    > * language-sensitive ordering ranges: [a-z] doesn't include o-slash
    > in Danish
    > * language-sensitive grapheme clusters: a dot matches "ch" in Slovak
    > * ...
    > Few implementations try to handle locale-sensitivity except for POSIX
    > (and that has significant problems in it). I wouldn't say that they are
    > infeasible, but they are tricky.

    Lots of programming problems are tricky. If I just gave up every
    time I ran into a tricky problem, my software wouldn't be very

    Being able to match grapheme clusters in regular expressions is
    a requirement for level 2 conformance, so I'm just trying to be
    compliant here. If I can also figure out how to make "." match
    the grapheme clusters a user specifies, such as "ch", what is
    wrong with that?


    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 13:42:04 CDT