Re: Internationalised Computer Science Exercises

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Sun, 28 Jan 2018 20:30:44 +0100

Typo, the full regexp has undesired asterisks:

  <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
  ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]
* <COMBINING DOT BELOW>
  | <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]
* < COMBINING CIRCUMFLEX>

2018-01-28 20:29 GMT+01:00 Philippe Verdy <verdy_p_at_wanadoo.fr>:

>
>
> 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
> unicode_at_unicode.org>:
>
>> On Sat, 27 Jan 2018 14:13:40 -0800The theory
>> of regular expressions (though you may not think that mathematical
>> regular expressions matter) extends to trace monoids, with the
>> disturbing exception that the Kleene star of a regular language is not
>> necessarily regular. (The prototypical example is sequences (xy)^n
>> where x and y are distinct and commute, i.e. xy and yx are canonically
>> equivalent in Unicode terms. A Unicode example is the set of strings
>> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that
>> will recognise canonically equivalent strings).
>>
>
> I don't see why you can't write this as the regular expression:
> (x | y)*
> For the Unicode canonical equivalences, this applies to distinct
> characters that have distinct non-zero combining classes.
>
> But of course searching for
>
> <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
> <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
>
> requires transforming it to NFD first as:
>
> <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
>
> so thet the regexp transforms to:
>
> <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] -
> [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW>
> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
> [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX>
>
> Note that the "complex" set of characters used three times above is
> finite, it contains all combining characters of Unicode that have a
> non-zero combining class except above and below, i.e. all Unicode
> characters whose combining class is not 0, 220 (below) or 230 (above).
>
> However, It is too simplified, because the allowed combining classes must
> occur at most once in each possible non-zero combining class and not
> arbitrary numbers of them: these allowed combining classs currently are in
> {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218,
> 222, 224, 226, 228, 232..234, 240} whose most member elements are used for
> very few combining characters (the above and below combining classes are
> the most populated ones but we exclude them, all the others have 1 to 9
> combining characters assigned to them, or 22 characters with cc=7 (nukta),
> or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama).
>
> Once again we can refine them also as a regexp, but this is combinatorial
> because we have 52 combining classes (so we would need to consider the 52!
> (factorial) alternates). But given the maximum length of what this can
> match (0 to 52 combining characters: yes it is finite), this is best done
> by not rewriting this as a regexp but by replacing the final "*" by {1,52},
> and then just check each returned match of
>
> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52}
>
> with a simple scan of these short strings to see that they all have
> distinct combining classes (this just requires 52 booleans, easily stored
> in a single 64 bit integer initialized to 0 prior to scan the scan of these
> small strings). But the theory does not prevent writing it as a regexp
> (even if it would be extremely long). So a Kleene Star closure is
> possible and can be efficiently implemented (all depends on the way you
> represent the "current state" in the FSM: a single integer representing a
> single node number in the traversal graph is not the best way to do that.
>
> This is a valid regexp, the finite state machine DOES have a finite
> lookahead (the full regexp above will match AT MOST 107 characters
> including the combining marks, where 107=3+2*52), but this is general to
> regexps that generally cannot be transformed directly into a FSM with
> finite lookahead, but a FSM is possible: the regexp first transforms into a
> simple graph of transitions with a finite number of node (this number is
> bound to the length of the regexp itself) where there can be multiple
> states active simultaneously; then a basic transform converts this small
> graph by transforming nodes into new nodes representing the finite set of
> the combinations of active states in the first graph :
>
> There will be many more nodes, and generally this explodes in size because
> the transform is combinatorial, and such size explosion has worst
> perfomance (explosion of the memory needed to representing the new graph
> with a single state active). So regexp engines use the alternative by
> representing the current state of traversal of the first simple graph using
> a stack of active states and transiting them all separately (this can be
> implemented with a "bitset" whose size in bits is the number of states in
> the first simple graph, or by using an associative array (dictionnary of
> boolean properties whose keys are state numbers in the first graph, which
> can be set or removed: this requires much less memory and it is relatively
> fast, even if the full current state is not just a single small integer.
>
>
Received on Sun Jan 28 2018 - 13:32:52 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 28 2018 - 13:32:53 CST