RE: New Public Review Issue: Proposed Update UTS #18

From: Philippe Verdy (
Date: Tue Oct 02 2007 - 13:48:06 CST

  • Next message: Asmus Freytag: "Re: Fish (was Re: Marks)"

    Asmus Freytag wrote:
    > After all, the atomic elements for writing would be the 'c' and 'h', it
    > is only for the purpose of some other text operations that 'ch' are
    > (sometimes) considered a unit.

    You gave an example with the swedish a with ring above: it is perceived as
    two units (even if they *may* be encoded as a single code point). And this
    changes radically the way a regexp like /[a-z]/ should match: will it match
    the 'a' in 'å' (even when it is encoded as a single precomposed code point),
    or consider 'å' only as a collation element that always sorts after 'z'
    (even if it is encoded in the decomposed form)?

    So even the simple thing like the regexp /[a-z]/ is not that simple and a
    simple regexp like /a/ may mean several things.

    This is an example where it will be necessary to make distinctions between
    several classes of regexp matching algorithms, here sorted by complexity:

    * (1) Regexp matchers that are only matching single code points (because
    they work in a locale context with simple binary order of code points) and
    will then consistently ignore any relations existing between successive code
    points, including the canonical equivalences.

            * These regexp matchers can't be said to support Unicode.
            * They are the classical POSIX regexps working in a POSIX or C
            * In such regexp matcher, [a-z] will ALWAYS match the 'a' within "å"
    if it encoded in decomposed form, but will NEVER match a putative 'a' within
    "å" if it is precomposed.
            * They will not even match the capital A with ring when looking for
    the Angström symbol.

    * (2) Regexp matchers that are trying to match according to some relations
    that exist between successive codepoints; they will adhere to the definition
    of canonical equivalence in Unicode (even if they don't recognize any other

            * This is the strict minimum needed to be a Unicode-compliant
            * They won't recognize language-specific features, but will work in
    a "Unicode neutral" locale where searches within canonically equivalent
    texts using canonical equivalent regexp will return the same set of matches
    (i.e. the segments of texts that these found matches are covering are
    canonically equivalent, not necessarily equal).
            * They won't need to recognize special case mappings with
    contractions or expansion or with language dependant mappings.
            * They won't need to recognize collation elements, not even those
    defined in the default Unicode collation element table (DUCET).
            * These regexp matchers will still work based on code points as the
    elementary unit (but with a universe of searches where some codepoints are
    considered equivalent or could have several encodings).
            * In such regexp matchers, [a-z] will NEVER match the 'a' in 'å'
    even if it is encoded in a decomposed form. But they will match any Angström
    symbol or capital A with ring in texts, if the regexp specifies any one of
    (replace the names here by the actual codepoints), because the regexps are
    canonically equivalent in their encoded forms.
            * To make the distinction, one will need to represent those strings
    in a way where no canonical equivalence can be inferred from the regexp
    itself, for example by using numeric character references or by referencing
    the characters by codepoint.
            * A mere encoding of the regexp using the actual codepoints will
    match any other canonically equivalent substrings.
            * Such regexp matcher SHOULD provide some syntax or global flag
    allowing to specify the behaviour of Regexp matchers in class (1) above.

    * (3) More advanced regexp matchers that will work according to some
    linguistic constraints according to common Unicode character properties, and
    will need to recognize advanced case mappings (with contractions or
    expansions) but still in a locale-neutral way.

            * These will still work using code points as their elementary work
            * They don't need to support the DUCET or any collation element, or
    to recognize something else than the binary order of code points in the
    ranges specified in [] character classes.
            * Such regexp matcher SHOULD provide some syntax or global flag
    allowing to specify the behaviour of Regexp matchers in class (1) or (2)

    * (4) More advanced regexp matchers that will now work according to
    locale-specific constraints (or equivalences).

            * Their base working unit is the collation element, and not the
    codepoint, which depends on a current locale context.
            * Ranges like [a-z] are interpreted according to the collation table
    and order of that locale.
            * Such regexp matcher SHOULD provide some syntax or global flag
    allowing to specify the behaviour of Regexp matchers in class (1) or (2) or
    (3) above.
            * If they support a syntax instead of a global flag for such uses,
    then the same regexp will need to handle those simpler matching rules as
    separate locales, distinct from the default working locale.
            * So there will be several locale contexts used in the regexps, and
    the collation elements or case mappings, will depend on the current active
    locale in scope within the regexp.
            * It will be eventually possible to specify regexps with parts
    matched according to one locale, and other parts matches according to
    another locale, providing distinct interpretations of the same input text.
            * In such case, for the same characters in the input text, depending
    on the position in the regexp where they are matched, there may be several
    distinct collation elements, according to the locale in scope within the
    regexp transition graph.
            * This means that such regexp may need to use several unit readers
    working in parallel to provide parallel suites of collation elements, one
    for each locale context in use in the regexp.
            * It the regexp supports capturing groups, the subsegments returned
    for each match should be interpreted according to the locale-context in
    which each capturing element is embedded.

    * (5) More advanced regexp matchers will allow a regexp to build or extend
    its own locale, by defining specific collation elements and ordering them
    according to other collation elements.

            * It is suggested to use some syntax derived from the one already
    used in the definition of tailored collations (like in the CLDR) if the
    definition of specific collation elements is made within the regexp syntax,
    but there may exist some difficulties (need to escape some parts of these
    collation definitions) in order to avoid collisions with the rest of the
    Regexp syntax.
            * Such modification of the Regexp syntax is not needed if those
    definitions of tailored collations are defined externally, but these
    tailored locales (with specific case mappings for example) and collations
    will need some way to reference them, using a syntax that is compatible with
    the one used in class (4) regexp matchers above for specifying specific

    Advanced collation rules (that require more than what the multilevel UCA
    algorithm describes) may be also supported using specific operators or
    syntax (for example if the regexp matcher engine includes some syntax to
    match numbers, and allowing them to be tested or ordered in ranges according
    to their numeric value). These rules could be tailorable and possibly added
    in the regexp syntax, within any of the above classes of regexp engines, but
    this goes to far for this discussion.

    This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 13:51:01 CST