RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (
Date: Mon Oct 01 2007 - 01:01:33 CST

  • Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

    > cluster, for example. Also, how would you interpret [a\u0300]?
    > As (a|\u0300) or (a\u0300)?

    I would interpret /[a\u0300]/ unambiguously as /(a|\u0300)/ only. To match a
    complete a with its accent:
    * we should not need to use the "\u" notation as an helper, but should
    encode the accent directly in the regexp, or should use the precombined
    character (because they are canonically equivalent).
    * If this is not possible (due to the input encoding for the regexp), then
    use \q{} to delimit the unbreakable collation element as in /[\q{a\u0300}]/
    or simply /\q{a\u0300}/ (which is an equivalent regexp here)

    Note how this simple rule does not break the canonical equivalence of the
    input regexps, whatever their encoding (the \u notation is not an encoding,
    but a regexp notation using multiple characters, and it implies no canonical
    equivalence between the regexp encoded directly without this notation, or
    the regexp using this notation).

    This archive was generated by hypermail 2.1.5 : Mon Oct 01 2007 - 01:04:53 CST