Re: Regular Expressions and Canonical Equivalence from Philippe Verdy on 2015-05-14 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 14 May 2015 12:58:29 +0200

2015-05-14 9:59 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> An elegant formal solution to the Kleene star problem interprets
> (\u0323\u0302)* as (\u0323|\u0302)*. However, that is
> counter-intuitive

Yes it is problematic: (ab)* is not the same as (a|b)* as this requires
matching pairs of letters "ab" in that order in the first expression, but
random strings of "a" and "b" i nthe second one (so the second matches
*more* input samples.

Even if you consider canonical equivalences (where the relative order of
"ab" does not matter for example because they have distinct non-zero
canonical) this does not mean that "a" alone will match in the first
expression "(ab*)", even though it MUST match in "(a|b)*".

So the solution is just elegant to simplify the first level of analysis of
"(ab)*" by using "(a|b)*" instead. But then you need to perform a second
pass on the match to make sure it is containing only complete sequences
"ab" in that order (or any other order if they are all combining with a
non-zero combining class) and no unpaired "a" or "b".

Such transform using two passes should only be made when subregexps within
a "(...)*" contain only alternatives (converted to NFD) such then each of
them contains ONLY combining characters with distinct non-zero combining
classes. If one of the alternatives "ab" contains any character with
combining class 0 or if they have blockers with identical non-zero
combining classes, we cannot use this transform.

But this transform using two passes is stil elegant: the alternatives where
we can use it and that requires a second pass have a bounded length (it's
impossible for them to be longer than 255 codepoints given there cannot be
more than 255 *distinct* non-zero combining classes. But even in this case,
the current UCD currently uses a much lower number of non-zero combining
classes, so this limit is even lower: the substrings where this transform
is possible will be extremely short and a second pass on them will be
extremely fast (using very small string buffers that can stay in memory).

For your example "(\u0323\u0302)*" the characters in the alternatives
(COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to NFD
(which is the same here) are just using at most two distinct non-zero
combining classes and no blocker; so it is safe to trasform it to
(\u0323|\u0302)* for a first pass matching that will then only check
candidate matchs in the second pass. or more efficiently, a second finite
state automata (FSA) running in parallel with its own state:

in your example this second FSA just has 2 states: the initial state 0
which is also the final/accept state, and state 1 after matching one
character of the pair. When you reach the point where matching (\u0323 |
\u0302)* with the first level of analysis would terminate, you just need to
check the state of the 2nd FSA to see if it is also in the
initial/final/accept state 0 (otherwise this is not an valid accept state
for the untransformed (\u0323\u0302)* regexp.

However, the most difficult part for regexps supporting canonical
equivalence is about what we can do to return submatches! they are not
necessarily contiguous in the input stream. You can still return a matching
substring but if you use it for performing search/replace operations, it
becomes difficult to know where to place the replacement, when that
replacement string (even if it was converted first to NFD) may also contain
combining characters. Or even worse if the replacement contains some
blockers that will be inserted in the middle of the non-replaced text (wnad
where can we safely place the remaining characters in the middle of the
match but that are not part of the match itself ???

One solution is to not exclude these characters in the middle of a match
and return them too. It's up to the replacement function to check their
existence:

The regexp engine can just provide an additional index of characters in the
returned matched substring, that are in fact not part of the actual match
but present in the middle, instead of just the substring for the match.

Or it can return just the exact matching substring, but also an index array
containing their relative position in the actual input string (in standard
matches those indexes would be the sequence of intergers 0 to N-1 where N
is the length of the matched substring, but if the sequence is
discontinusous in the input, the sequence will still be growing with some
steps higher than 1, leaving some holes (the last index in that sequence
will be equal to or higher than N).
Received on Thu May 14 2015 - 06:00:03 CDT

This archive was generated by hypermail 2.2.0 : Thu May 14 2015 - 06:00:04 CDT