Re: Internationalised Computer Science Exercises from Philippe Verdy via Unicode on 2018-01-29 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Mon, 29 Jan 2018 07:16:04 +0100

2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
unicode_at_unicode.org>:

> On Sun, 28 Jan 2018 20:29:28 +0100
> Philippe Verdy via Unicode <unicode_at_unicode.org> wrote:
>
> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
> > unicode_at_unicode.org>:
> >
> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory
> > > of regular expressions (though you may not think that mathematical
> > > regular expressions matter) extends to trace monoids, with the
> > > disturbing exception that the Kleene star of a regular language is
> > > not necessarily regular. (The prototypical example is sequences
> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are
> > > canonically equivalent in Unicode terms. A Unicode example is the
> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II -
> > > there is no FSM that will recognise canonically equivalent strings).
> > >
> >
> > I don't see why you can't write this as the regular expression:
> > (x | y)*
>
> Because xx does not match.
>
> In principle, it can be done iteratively thus:
>
> 1) Look for sequences of x's and y's - your (x | y) *
> 2) Discard matches from (1) where the number of x's and y's are equal.
>
> However, the second step cannot be implemented by a *finite* state
> machine.
>
> > For the Unicode canonical equivalences, this applies to distinct
> > characters that have distinct non-zero combining classes.
>
> Those of us who've looked at optimising collation by reducing
> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in
> theory, a source of many problems.
>
> > But of course searching for
> >
> > <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
> > <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
> >
> > requires transforming it to NFD first as:
>
> That wasn't I had in mind. What I had in mind was accepting the
> propositions that the string <LATIN SMALL LETTER A, COMBINING DOT
> BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH
> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW.
>
> > <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
> >
> > so thet the regexp transforms to:
> >
> > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] -
> > [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> *
> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT
> > BELOW> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
> > BELOW> [[:cc=above:][:cc=below:]]
> > ]] * < COMBINING CIRCUMFLEX>
>
> If everything is converted to NFD, the regular expressions using traces
> can be converted to frequently unintelligible regexes on strings; the
> behaviour of the converted regex when faced with an unnormalised string
> is of course irrelevant.
>
> In the search you have in mind, the converted regex for use with NFD
> strings is actually intelligible and simple:
>
> <LATIN SMALL LETTER A>
> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> <COMBINING DOT BELOW>
> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> <COMBINING CIRCUMFLEX>
>
> Informal notation can simplify the regex still further.
>
> There is no upper bound to the length of a string matching that regex,
>

Wrong, you've not read what followed immediately that commented it already:
it IS bound exactly because you cannot duplicate the same combining class,
and there's a known finite number of them for acceptable cases: if there's
any repetition, it will always be within that bound. But it not necessay to
expand all the combinations of combining classes to all their possible
ordering of occurence (something that a classic regexp normally requires by
expecting a specific order).

One way to solve it would have to have (generic) regexp extension allowing
to specify a combination of one or more of several items in a choice list
in any order, but never more than one occurence of each of item. This is
possible using a rule with boolean flags of presence, one boolean for each
item in the choice list.

Something like {?a|b|c|d} matching zero or more (or all of them) of a,b,c,d
(these can be subregexps) in any order, and {?+a|b|c|d} matching one or
more, and {?{m,n}a|b|c|d} matching betwen m and n of them (in any order in
all cases)
So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the capture,
i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a} is the
same as a, and {?*a} is the same as a?

Which can also be written respectrively as {?*[abcd]}, {?+[abcd]}
and {?{m,n}[abcd])
if the items of the choice list are characters that can be compacted with
the classic "character class" notation [abcd].

In all these the "{?quantifier list}" notation is always bound by the
number of items in the list (independantly of the quantifier, and if
individual items in the list are bound in length, the whole will be bound
by the sum of their lengths. So even if the quantifier is higher than than
the number of items in the list, it will be capped: "{?{1000}a}" will only
match "a", and "{?{1000}}" will never match anything (because the list is
empty: the specified higher bound 1000 is capped to 0 but the specified
lower bound 1000 is capped to 1 and this is impossible) and is also
equivalent to {?} where the min-max specified bounds are 1 by default, but
capped to 1,0
Received on Mon Jan 29 2018 - 00:16:44 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 29 2018 - 00:16:45 CST