From: Richard Wordingham <richard.wordingham_at_ntlworld.com>

Date: Fri, 15 May 2015 08:10:03 +0100

Date: Fri, 15 May 2015 08:10:03 +0100

On Fri, 15 May 2015 02:10:36 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

*> 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
*

*> richard.wordingham_at_ntlworld.com>:
*

*>
*

*> > On Thu, 14 May 2015 12:58:29 +0200
*

*> > Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
*

*> >
*

*> > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
*

*> > > richard.wordingham_at_ntlworld.com>:
*

*> > >
*

*> > > > An elegant formal solution to the Kleene star problem interprets
*

*> > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is
*

*> > > > counter-intuitive
*

*> >
*

*> > The technical term for this is the 'concurrent iteration' - or at
*

*> > least, that's the term used in the 'Book of Traces'.
*

*> >
*

*> > > For your example "(\u0323\u0302)*" the characters in the
*

*> > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT),
*

*> > > once converted to NFD (which is the same here) are just using at
*

*> > > most two distinct non-zero combining classes and no blocker; so
*

*> > > it is safe to trasform it to (\u0323|\u0302)* for a first pass
*

*> > > matching that will then only check candidate matchs in the second
*

*> > > pass. or more efficiently, a second finite state automata (FSA)
*

*> > > running in parallel with its own state:
*

*> >
*

*> > You've forgotten the basic problem. A *finite* state automaton
*

*> > cannot count very far; with only n states, it cannot count as far
*

*> > as n.
*

*> >
*

*>
*

*> I did not forget it, this is why there's a second pass (or a second
*

*> FSA running in parallel to indicate its own accept state). You have
*

*> to combine the two states variables to get the final combined state
*

*> to determine if it is a final accept state.
*

Your description makes no sense to me as a description of a finite

state automaton. Now, a program to check whether a trace matching

{\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just

counts the number of times \u0323 occurs and the number of times

\u0302 occurs, and returns whether they are equal. The two counters

are the key variables (and one could just keep the difference in the

counts). However, this is not a finite state automaton.

Now, to some extent I am cheating by assuming that the characters are

delivered in NFD order. If I did not do this, to construct the

non-deterministic finite automaton (NDFA) for the concatenation of two

sets / regular expressions, the triples (x, y, n) of (left NDFA state,

right NDFA state, highest non-zero ccc assigned to righthand component)

would need to be expanded. The third component would become a list of

non-zero ccc's - in principle 2^254 values, but in fact rather fewer as

not all 255 ccc values are used by Unicode. It is still finite. I

prefer to keep the complexity out of the regular expression engine

proper.

Given a NDFA recognising a set of NFD strings, one can convert it to a

deterministic finite automaton (DFA), say X, provided one does not run

out of memory or time. One can then 'easily' construct a DFA Y

recognising the canonical equivalents of the strings. The state in DFA

Y reached by string x is defined to be the state reached by the string

to_NFD(x) in DFA X. This method relies on the identity

to_NFD(to_NFD(x)z) = to_NFD(xz).

This handles the recognition of a string canonically equivalent to

\u0323\u0302. (The constructions above are sledge hammers; the NDFAs

have many unreachable states.) However, recognising canonical

equivalents of (\u0323\u0302)* via an FSM is rather more difficult; to

be precise, it cannot be done by an FSM.

Richard.

Received on Fri May 15 2015 - 02:11:16 CDT

*
This archive was generated by hypermail 2.2.0
: Fri May 15 2015 - 02:11:16 CDT
*