RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (
Date: Tue Sep 25 2007 - 16:48:17 CDT

  • Next message: James Kass: "RE: Composition of not included Chinese characters"

    > part de Marion Gunn
    > Envoy: mardi 25 septembre 2007 14:11
    > : Unicode Discussion
    > Cc: Mike; Unicode Mailing List
    > Objet: Re: New Public Review Issue: Proposed Update UTS #18
    > Tricky? Perhaps so, Mark, but solutions are the name of the game. In
    > any case, we need to add CH as a single letter in both Welsh and
    > Breton, C'H as a single letter in Breton, FF as a single letter in
    > Welsh, NG as a single letter in Welsh, etc., in all implementations.
    > mg

    It should be noted that "C'H" is a single collation element in Breton, but
    that it is not the only representation of this collation element; this
    collation element includes other representations of the apostrophe, and
    notably "CH" which collates equivalently (there's just the "last chance"
    difference, at the last level for codepoints, there's no primary or
    secondary difference).

    One question remains here foruse in regexps: if the input universe is
    defined in such a way that "." matches a single collation element in the
    input locale or in the current locale context, then it will match several
    distinct strings that are not necessarily canonically equivalent (in
    addition to possibly distinct but canonically equivalent strings).

    So how can a regexp specify that only one specific form of this collation
    element match? Suppose that a regexp user wants to find all occurrences in a
    text where "c'h" is used in a Breton text instead of the recommended "ch"
    form (which is equivalent linguistically).

    This is not restricted to Breton: one could do the same thing about "'" and
    "" used in English or French where they are also perceived as equivalent
    instances of the the same collation element, and should be treated
    identically. If we want regexps to be usable in linguistic contexts, then we
    must be able to collate strings correctly according to languages, and then
    be able to make distinctions only when this is *explicitly* specified in the

    For this, we'll need a special escaping mechanism that will disable the
    interpretation as collation element classes, but that will still maintain
    the interpretation as unbreakable elements (not part of the input universe).
    One mean is the iuse of \uxxxx or \Uxxxxxxxx, but this is not the easiest
    way to reer to them; if the collation elements treated identically are not
    canonically equivalent, like the various apostrophes, we should beable to
    escape them by using them directly within the regexp, without using any
    triky hexadecimal notation: if \q{c'h} is used to refer to the Breton
    collation element, does it match \q{ch} ? If it does not match, then how
    can we simply search for either "c'h" or "ch" in the Breton locale context
    (which should still be the normal way to look for them)?

    So, maybe \q{c'h} will match both "c'h" and "ch" in the Breton locale
    context, and either one of:

    * \Q{c'h} will match only "c'h" (using \Q instead of \q means that it won't
    use the current locale context where aquivalent collation elements are
    recongized, but will only refer to the default collation elements that are
    canonically equivalent).
    * or may be \Q{C!c'h} where we specify a simpler locale, here the C locale,
    where no collation ever occurs, and where not even the canonical
    equivalences are recognized, unlike \Q{POSIX!c'h} where canonical
    equivalences are possible
    * or \Q{U!c'h} in a "U" locale referencing the default Unicode DUCET where
    canonical equivalences should be recognized, but without impact here on the

    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 16:50:34 CDT