Date: April 6, 2006
From: Mark Davis
Subject: Bounded Reordering Format
I propose that we make the following addition to UAX #15. (This is a rough draft, and would be refined in discussion.)
* Add the following text, probably in an appendix.
There are certain environments where people would like to make use of normalization, but where there are implementation constraints such as in buffered serialization. Consider the extreme case of a string containing a 2 followed by 10,000 umlauts followed by one dot-below, then a 3. In the process of normalization, the dot-below at the end must be reordered to immediately after the 2, which means that 10,003 characters need to be considered as a whole before the result can be output.
While such text is not illegal, it is not practically meaningful; yet the possibility of encountering it forces a conformant, serializing implementation to provide large buffer capacity, or provide a special exception mechanism just for such degenerate cases. To address this situation, the following mechanism is specified:
D5. A string is said to be in Bounded Reordering Format if it does not contain any sequences of non-starters longer than 30 characters in length.
D6. The Bounded Reordering Process is defined as the process of producing a string in Bounded Reordering Format by processing the text from start to finish, inserting a CGJ character within long sequences of non-starters. The exact position of the inserted CGJs are determined according to the following logical process, which describes the generation of an output string from an input string.
Implementations can optimize the above specification as long as they produce the same results. In particular, the information used in Step 3 can be precomputed: it does not require the actual normalization of the character. In Unicode 5.0, for example, it consists of the following:
|Count||Initial non-starters||Starters||Trailing non-starters||Example|
||LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON|
||GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI|
||COMBINING GRAVE ACCENT|
||COMBINING GREEK DIALYTIKA TONOS|
With any normal text, the Bounded Reordering Process will not modify the text at all. Where modification does result, it is no longer canonically equivalent to the original, but the modifications are minor and don't disturb any meaningful content. The modified text contains all of the content of the original, with the only difference being that reordering is blocked across long groups non-starters. Any text in Bounded Reordering Format can then be normalized with very small buffers using any of the standard Normalization Forms.
This logical process can be implemented in the same implementation pass as normalization for efficient processing. In such case, the choice of whether to do the Bounded Reordering Process or not can be controlled by an input parameter.
* Add a conformance clause, so that people can refer to it.
UAX15-C4. A process that purports to transform text according to the Bounded Reordering Process must do so in accordance with the specifications in this document.