Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

From: Ilya Zakharevich <>
Date: Wed, 23 Apr 2014 00:35:02 -0700

On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote:
> if you read UAX#9, the way the algorithm works is by pushing openers
> on a stack, then, on finding the first closer, going down the stack
> and attempting to locate a match, then, on finding a match,
> discarding any enclosed openers, on not finding a match, discarding
> the closer.

I think I LOVE this definition. Simple, beautiful, and IMO following
people’s expectations very closely.

Here is what “theoretizing” gives:

 a parsing is good if it satisfies all conditions below:

   0) Some delimiters in the string are marked as “non-matching”; the rest
      is broken into disjoint “matched” pairs;

   MATCH) A “matched” pair consists of an open-delimiter and matching close-
          delimiter (in this order in the string).

   NEST) “Matched” pairs are properly nested (meaning that 2 pairs cannot be
         positioned as Open1 Open2 Close1 Close2 in the string order).

   MINLEN) “Inside” a “matched” pair, every delimiter which could match elements
           of the pair but is marked as “non-matching” must nest inside
           some deeper-nested “matched” pair.

(I hope that the meaning of the word “inside” in MINLEN is clear.)

   GREED) Given any close-delimiter marked as “non-matching”, its
          pre-context does not contain any open-delimiter which could
          match it.

     Here pre-context of a position is a concatenation of substrings of the
     initial string:
     • Take the most deeply nested “matched pair” containing the position
       (if none, the whole string);
     • take the part of the string inside this pair AND before the position;
     • remove all “matched” pairs completely contained insidde this substring
       together with what they enclose.


P.S. Judging by another message of yours, for you “theoretizing” is a
      4-letter word… Oh well…
Unicode mailing list
Received on Wed Apr 23 2014 - 02:36:38 CDT

This archive was generated by hypermail 2.2.0 : Wed Apr 23 2014 - 02:36:38 CDT