Re: Regexes, Canonical Equivalence and Backtracking of Input from Philippe Verdy on 2015-05-18 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 19 May 2015 01:25:54 +0200

I don't work with strings, but with what you seem to call "traces", but
that I call sets of states (they are in fact bitsets, which may be
compacted or just stored as arrays of bytes containing just 1 usefull bit,
but which may be a bit faster; byte arrays are just simpler to program).,
in a stack (I'll use bitsets later to make the structure more compact, if
needed, but for now this is fast enough and not memory intensive even for
large regexps with many repetitions with "+/*/{m,n}" or variable parts).
The internal matcher uses NFD, but needs to track the positions in the
original buffered input for returning captured matches.

There's some optiomization to reduce the size of the bitsets, by defining
classes. The representation of classes in Unicode is more challenging than
with plain ASCII or ISO8859-*, for this reason I limit their length
(differences between the smallest and highest code point), and over this
size the classes are just defined as a sorted string of pairs of
codepoints: I can perform a binary search in that string and look at the
position: with an even position the character is part of the class, with an
odd position, the character is not part of it).

Thanks to a previous message you posted, I noted that my code deos not work
correctly with Hangul precomposed syllables (I perform the decompoisition
to NFD of the input on the fly in the input buffer, but the buffer is
incorrectly advanced when there's a match to the next character, and it can
skip one or two characters of the original input instead of code points in
the NFD transformed input. (I don't have extensive cases for testing
Hangul, I have much more for Latin, Greek, Cyrillic and Arabic, but also
too few for Hebrew where "pathological" cases of regexps are certainly more
likely to occur than in Latin, even with Vietnamese and its frequent double
diacritics).

For now with the complex cases of replacements, I have no precise syntax
defined for specifiying replacements as as simple string with placeholders
I just allow these matches to be passed as objects (rather than just
strings) to a callback that performs the substitutions itself using the
array of captures given by the engine to the callback; I have no idea for
now about how to handle the special cases occuring when computing the
actual replacements:

The callback can insert/delete subsequences everywhere in the input buffer
which is limited in size by the extent of $0, plus any intermediate
characters when there's a discontinuity, plus their left and right contexts
when the match still does not include the full combining sequences (for
most uses cases, the left context is empty, but the right context is
frequently non-empty and contains all combining characters on over the last
base which is part of the match; the callback also does not necessarily
have to modify the input buffer it it does not want to perform replacements
in it, but in that case the input buffer is readonly and I don't need to
feed the contexts which remain empty. There are also left and right context
variables for *each* capture group (some of them may be partly or fully in
another returned capture group).

Finally a question:

I suppose that like many programmers you have read the famous "green
dragon" book of Sethi/Aho/Ullman books about compilers. I can understand
the terminology they use when spoeaking about automatas (and that is found
in many other places), but apparently you are using some terms that I have
to guess from their context.
Good books on the subjext are now becoming difficutlt to find (or they are
more expensive now), and too difficult to use on the web (for such very
technical topics, it really helps to have a printed copy, that you an
annotate, explore, or have beside you instead of on a screen (and printing
ebooks is not an option if they are voluminous). May be you have other
books to recommend. But finding these books in libraries is now becoming
difficult when many are closing or reducing their proposed collections (and
I don't like buying books on the Internet). For the rest, I tend to just
describe what I've made or used or experimented, even if the terms are not
the best ones (some of my references are in French, and dificutl to
translate).

On difficult topics like this one, I'm not paid to perform research and I
can only do that in my spare time from time to time, until I can make
something stable enough for a limited use (without experimental features)
In the past I could work on such research topic, but now we are pressed to
use extisting libraries and not pass lot of time, we sell smaller
incremental but limtied improvements and we know what is volutarily limited
and left unimplemented.

2015-05-18 23:14 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Mon, 18 May 2015 22:56:47 +0200
> Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
>
> > Isn't it possible for your basic substitution to transform \uf073
> > into a character class [\uf071\uf072\uf073] that the regexp considers
> > as a single entity to check ?
> > In that case, backtracking for matching \u0F73*\u0F72 is simpler:
> > [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only
> > one character class (instead of one character).
>
> I'm still waiting for your explanation of how your scheme for European
> diacritics (as used in SE Asia) would work. This thread is intended for
> the idea of using the regex to decide which character to take as the
> next character from the input trace. In the other thread, I'm still not
> sure whether you're working with traces or strings.
>
> Richard.
>
Received on Mon May 18 2015 - 18:27:29 CDT

This archive was generated by hypermail 2.2.0 : Mon May 18 2015 - 18:27:30 CDT