L2/06-110R

Date: April 6, 2006
From: Mark Davis
Subject: Bounded Reordering Format


I propose that we make the following addition to UAX #15. (This is a rough draft, and would be refined in discussion.)

* Add the following text, probably in an appendix.

XX Bounded Reordering Format

There are certain environments where people would like to make use of normalization, but where there are implementation constraints such as in buffered serialization. Consider the extreme case of a string containing a 2 followed by 10,000 umlauts followed by one dot-below, then a 3. In the process of normalization, the dot-below at the end must be reordered to immediately after the 2, which means that 10,003 characters need to be considered as a whole before the result can be output.

While such text is not illegal, it is not practically meaningful; yet the possibility of encountering it forces a conformant, serializing implementation to provide large buffer capacity, or provide a special exception mechanism just for such degenerate cases. To address this situation, the following mechanism is specified:

D5. A string is said to be in Bounded Reordering Format if it does not contain any sequences of non-starters longer than 30 characters in length.

D6. The Bounded Reordering Process is defined as the process of producing a string in Bounded Reordering Format by processing the text from start to finish, inserting a CGJ character within long sequences of non-starters. The exact position of the inserted CGJs are determined according to the following logical process, which describes the generation of an output string from an input string.

  1. If the input string is empty, return.
  2. Set nonStarterCount to zero.
  3. For each code point C in the input string:
    1. Produce the NFKD decomposition S.
    2. If nonStarterCount plus the number of initial non-starters in S is greater than 30, append a CGJ to the output string, and set the counter to zero.
    3. Append C to the output string.
    4. If there are no starters in S, add the number of code points in S to nonStarterCount.
    5. Otherwise, set nonStarterCount to the number of trailing non-starters in S (which may be zero).
  4. Return the output string.

Implementations can optimize the above specification as long as they produce the same results. In particular, the information used in Step 3 can be precomputed: it does not require the actual normalization of the character. In Unicode 5.0, for example, it consists of the following:

Count Initial non-starters Starters Trailing non-starters Example
1,112,646 0 1 0 - all others
793 0 1 1 00A8 DIAERESIS
248 0 1 2 01D5 LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON
36 0 1 3 1F82 GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
385 1 0 0 0300 COMBINING GRAVE ACCENT
4 2 0 0 0344 COMBINING GREEK DIALYTIKA TONOS

With any normal text, the Bounded Reordering Process will not modify the text at all. Where modification does result, it is no longer canonically equivalent to the original, but the modifications are minor and don't disturb any meaningful content. The modified text contains all of the content of the original, with the only difference being that reordering is blocked across long groups non-starters. Any text in Bounded Reordering Format can then be normalized with very small buffers using any of the standard Normalization Forms.

This logical process can be implemented in the same implementation pass as normalization for efficient processing. In such case, the choice of whether to do the Bounded Reordering Process or not can be controlled by an input parameter.

* Add a conformance clause, so that people can refer to it.

UAX15-C4. A process that purports to transform text according to the Bounded Reordering Process must do so in accordance with the specifications in this document.