From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 02 2007 - 13:48:06 CST
Asmus Freytag wrote:
> After all, the atomic elements for writing would be the 'c' and 'h', it
> is only for the purpose of some other text operations that 'ch' are
> (sometimes) considered a unit.
You gave an example with the swedish a with ring above: it is perceived as
two units (even if they *may* be encoded as a single code point). And this
changes radically the way a regexp like /[a-z]/ should match: will it match
the 'a' in 'å' (even when it is encoded as a single precomposed code point),
or consider 'å' only as a collation element that always sorts after 'z'
(even if it is encoded in the decomposed form)?
So even the simple thing like the regexp /[a-z]/ is not that simple and a
simple regexp like /a/ may mean several things.
This is an example where it will be necessary to make distinctions between
several classes of regexp matching algorithms, here sorted by complexity:
* (1) Regexp matchers that are only matching single code points (because
they work in a locale context with simple binary order of code points) and
will then consistently ignore any relations existing between successive code
points, including the canonical equivalences.
* These regexp matchers can't be said to support Unicode.
* They are the classical POSIX regexps working in a POSIX or C
locale.
* In such regexp matcher, [a-z] will ALWAYS match the 'a' within "å"
if it encoded in decomposed form, but will NEVER match a putative 'a' within
"å" if it is precomposed.
* They will not even match the capital A with ring when looking for
the Angström symbol.
* (2) Regexp matchers that are trying to match according to some relations
that exist between successive codepoints; they will adhere to the definition
of canonical equivalence in Unicode (even if they don't recognize any other
relations).
* This is the strict minimum needed to be a Unicode-compliant
process.
* They won't recognize language-specific features, but will work in
a "Unicode neutral" locale where searches within canonically equivalent
texts using canonical equivalent regexp will return the same set of matches
(i.e. the segments of texts that these found matches are covering are
canonically equivalent, not necessarily equal).
* They won't need to recognize special case mappings with
contractions or expansion or with language dependant mappings.
* They won't need to recognize collation elements, not even those
defined in the default Unicode collation element table (DUCET).
* These regexp matchers will still work based on code points as the
elementary unit (but with a universe of searches where some codepoints are
considered equivalent or could have several encodings).
* In such regexp matchers, [a-z] will NEVER match the 'a' in 'å'
even if it is encoded in a decomposed form. But they will match any Angström
symbol or capital A with ring in texts, if the regexp specifies any one of
/<A><COMBINING RING ABOVE>/ or /<A WITH RING ABOVE>/ or /<ANGSTROM SYMBOL>/
(replace the names here by the actual codepoints), because the regexps are
canonically equivalent in their encoded forms.
* To make the distinction, one will need to represent those strings
in a way where no canonical equivalence can be inferred from the regexp
itself, for example by using numeric character references or by referencing
the characters by codepoint.
* A mere encoding of the regexp using the actual codepoints will
match any other canonically equivalent substrings.
* Such regexp matcher SHOULD provide some syntax or global flag
allowing to specify the behaviour of Regexp matchers in class (1) above.
* (3) More advanced regexp matchers that will work according to some
linguistic constraints according to common Unicode character properties, and
will need to recognize advanced case mappings (with contractions or
expansions) but still in a locale-neutral way.
* These will still work using code points as their elementary work
unit.
* They don't need to support the DUCET or any collation element, or
to recognize something else than the binary order of code points in the
ranges specified in [] character classes.
* Such regexp matcher SHOULD provide some syntax or global flag
allowing to specify the behaviour of Regexp matchers in class (1) or (2)
above.
* (4) More advanced regexp matchers that will now work according to
locale-specific constraints (or equivalences).
* Their base working unit is the collation element, and not the
codepoint, which depends on a current locale context.
* Ranges like [a-z] are interpreted according to the collation table
and order of that locale.
* Such regexp matcher SHOULD provide some syntax or global flag
allowing to specify the behaviour of Regexp matchers in class (1) or (2) or
(3) above.
* If they support a syntax instead of a global flag for such uses,
then the same regexp will need to handle those simpler matching rules as
separate locales, distinct from the default working locale.
* So there will be several locale contexts used in the regexps, and
the collation elements or case mappings, will depend on the current active
locale in scope within the regexp.
* It will be eventually possible to specify regexps with parts
matched according to one locale, and other parts matches according to
another locale, providing distinct interpretations of the same input text.
* In such case, for the same characters in the input text, depending
on the position in the regexp where they are matched, there may be several
distinct collation elements, according to the locale in scope within the
regexp transition graph.
* This means that such regexp may need to use several unit readers
working in parallel to provide parallel suites of collation elements, one
for each locale context in use in the regexp.
* It the regexp supports capturing groups, the subsegments returned
for each match should be interpreted according to the locale-context in
which each capturing element is embedded.
* (5) More advanced regexp matchers will allow a regexp to build or extend
its own locale, by defining specific collation elements and ordering them
according to other collation elements.
* It is suggested to use some syntax derived from the one already
used in the definition of tailored collations (like in the CLDR) if the
definition of specific collation elements is made within the regexp syntax,
but there may exist some difficulties (need to escape some parts of these
collation definitions) in order to avoid collisions with the rest of the
Regexp syntax.
* Such modification of the Regexp syntax is not needed if those
definitions of tailored collations are defined externally, but these
tailored locales (with specific case mappings for example) and collations
will need some way to reference them, using a syntax that is compatible with
the one used in class (4) regexp matchers above for specifying specific
locales.
Advanced collation rules (that require more than what the multilevel UCA
algorithm describes) may be also supported using specific operators or
syntax (for example if the regexp matcher engine includes some syntax to
match numbers, and allowing them to be tested or ordered in ranges according
to their numeric value). These rules could be tailorable and possibly added
in the regexp syntax, within any of the above classes of regexp engines, but
this goes to far for this discussion.
This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 13:51:01 CST