Re: Regular Expressions and Canonical Equivalence from Philippe Verdy on 2015-05-14 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Fri, 15 May 2015 02:38:17 +0200

2015-05-14 20:13 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> If the interval list is compacted, at most one of the intervals will
> contain a character properly having combining class 0.

This is not a sufficent condition, there is also the case where two
intervals contain combining characters with the same combining class: their
relative order is significant because one is blocking the other (it limits
the alllowed reorderings that are canonically equivalent).

But if the replacement string also adds its own blockers the situation is
worse...
There's no simple way to determine what to do by just returning a
replacement string that the regexp engine will insert itself in the output
text: the base that can be done is that the regexp gives a full view not
only to the characters withjin matches, but also the characters in the
middle that are not part of the match: instead of performing the insertion
itself (by specifying a single expression for the replacement text), you
will provide a callback function analysing also the non-matched characters
in the middle to decide what to do with them: you should then be able to
choose between several replacement patterns (including placeholders also
for unmathed intervals such as numbered placeholders with negative values
$-1, $-2, ..., positive or null numbers being used for the classical array
of matched captures $0, $1... But for these additional captures that are
not part of the match, you need a way to indicate their placement within
the true matched captures, and not all positive captures share the same set
of negative captures and not at the same positions).

Note that for making sure we can perform safe replacements within
normalized text and makeing sure that the result will also be normalized,
we need to include in negative captures some characters that are not in the
middle of a match, but also all the other combining characters with
non-zero combining class that are before the matched string (if the matched
string does not start with a character with combining class 0) and after it
and that have a higher combining class than the last character in the
positive capture.; if the positive capure is an ampty string, the first
negative capture will include all combining characters with distinct non-0
combining class. before the insertion point of that empty positive capture,
and the second one will onclude all non-0 combining characters after thje
insertion point that have distinct non-0 combining classes (these two
negative captures are bounded in length to at most 255 characters, just
like with the negative captures added for parts of the input that are in
the middle of a positive capture).

For now I've never seen any regexp engine supporting the concept of
"negative captures", all of them only return positive ones, including when
they allow the replacement to be a callback and not just a static string
with optional placeholders.

If there is such an interval, it will be
> replaced and the others simply deleted. If there is no such interval,
> then the choice of insertion point may be more difficult. Indeed, in
> some cases, it could be appropriate to reject the replacement command
> as undefined in the context. On the other hand, if the text buffer is
> normalised, then one would be able to have well-defined behaviour, as
> one does when splitting text into UCA collating elements.
>
> Richard.
>
Received on Thu May 14 2015 - 19:40:04 CDT

This archive was generated by hypermail 2.2.0 : Thu May 14 2015 - 19:40:04 CDT