From: Richard Wordingham <richard.wordingham_at_ntlworld.com>

Date: Mon, 18 May 2015 01:03:02 +0100

Date: Mon, 18 May 2015 01:03:02 +0100

On Sun, 17 May 2015 16:33:15 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

*> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
*

*> richard.wordingham_at_ntlworld.com>:
*

*>
*

*> > On Sat, 16 May 2015 18:29:18 +0200
*

*> > Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
*

*> >
*

*> > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
*

*> > > richard.wordingham_at_ntlworld.com>:
*

*> > >
*

*> > > > There is an annoying error. You appear to assume that U+0302
*

*> > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute,
*

*> > > > but they don't; they have the same combining class, namely
*

*> > > > 230. I'm going to assume that 0303 is a typo for 0323.
*

*> > >
*

*> > >
*

*> > > Not a typo, and I did not made the assumption you suppose because
*

*> > > I chose then so that they were effectively using the **same**
*

*> > > combining class, so that they do not commute.
*

*> >
*

*> > In that case you have an even worse problem. Neither the trace nor
*

*> > the string \u0303\u0302\u0302 matches the pattern
*

*> > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match
*

*> > the regular expression
*

*> > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
*

*> > 0303|˕\0303˕\u0302)*˕\u0302˕
*

*> >
*

*> > You've transformed (\u0302\u0303) into
*

*> > (˕\u0302˕\0303|˕\0303˕\u0302), but that is unnecessary and wrong,
*

*> > because U+0302 and U+0303 do not commute.
*

*>
*

*>
*

*> Oh right! Thanks for pointing, it was intended you can read it as.
*

*>
*

*> (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕
*

*>
*

*> But my argument remains because of the presence of \0302 in the second
*

*> subregexp (which additionally is a separate capture, but here I'm not
*

*> concentrating on the impact in numbered captures, but only on the
*

*> global capture aka $0)
*

*>
*

*>
*

*> > > It was the key fact of my argument that destroys your
*

*> > > argumentation.
*

*> >
*

*> > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
*

*> > corrected new regex
*

*> >
*

*> > (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕
*

*> >
*

*> > Do you claim that this argument is destroyed? If it is irrelevant,
*

*> > why is it irrelevant? It shows that your transform does not solve
*

*> > the original problem of missed matches.
*

*> >
*

*>
*

*> Why doesn't it solve it?
*

Sorry, my example wasn't quite right. It should have two combining

dots below and five circumflexes, not four as I wrote it. I will

first explain how my NDnear-FA handles it - I have now removed the

generation of the dead end states.

Initial states:

0) LLLL0 # Starting the \u0302\u0302\u0323 factor,

# implemented as \u0323\u032\u0320

1) LLRM # Completed the zero trip alternative to (\u0302\u0302\u0323)+

# Not actually useful.

2) LRLL0 # Starting the \u0302\u0303 factor

3) LRRM # Completed the zero trip alternative to (\u0302\u0303)+

4) R0 # Starting the \u0302 factor

=0323=00:06:=

LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323

=0323=06:012:=

LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor.

# Both have progressed as far as \u0323.

# Finiteness would restrict me to, say, 3 repeats

# in progress.

# The states of the finite DFA are a cross product of 3 copies of

# the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant

# ccc values. By no means all of these states are used.

# In the Kleene stars of the regular expression guaranteed by

# recognisability, 3 copies caters for the worst case, xyz, where x

# has a starter and ends in a non-starter, y consists of non-starters

# with the same canonical combining class, and z starts with

# non-starter and contains a starter, e.g.

# x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42

# to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42

=0302=012:018:=

LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors

# First has progressed to \u0323\u0302 and second to

# \u0323. The other way round has been pruned by the

# automated observation that if \u0302 is blocked from

# first factor, the factor cannot be completed.

=0302=018:024:=

LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor

LLLN001220:4:L2 => LLLL2 # As first factor is complete, remove it from

# consideration and relabel second factor as

# first.

=0302=024:030:=

LLLL2 => LLLL4 # \u0323\0302\u0302 completed as far as \u0323\u0302

=0302=030:036:=

LLLL4 => LLLLM # \u0323\u0302\u0302 is complete.

LLLL4 => LRLL0 # So start \u0302\u0303 factor.

LLLL4 => LRRM # Alternatively, completed the zero trip option of

# (\u0302\u0303)*

LLLL4 => R0 # Or, we have progressed as far as the final \u0302

LLLL4 => LLLL0 # Or, start another \u0323\u0302\u0302

=0302=036:042:=

LRLL0 => LRLL2 # Got as far as \u0302 in \u0302\u0303

R0 => RM (match) # Or completed the final \u0302.

End marker is at 042:OVF

Could you please talk me through how your system recognises the string

\u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex. I

can't work out how it is supposed to work from your description.

Richard.

Received on Sun May 17 2015 - 19:04:41 CDT

*
This archive was generated by hypermail 2.2.0
: Sun May 17 2015 - 19:04:41 CDT
*