L2/06-110 Date: April 6, 2006 From: Mark Davis Subject: Bounded Reordering Format ===== I propose that we make the following addition to UAX #15. (This is a rough draft, and would be refined in discussion.) * Add the following text, probably in an appendix. XX Bounded Reordering Format There are certain environments where people would like to make use of normalization, but where there are implementation constraints such as in buffered serialization. Consider the example of a string containing an 'a' followed by 10,000 umlauts followed by one dot below. In the process of normalization, the dot below at the end must be reordered to immediately after the 'a', which means that 10,002 characters need to be considered as a whole before the result can be output. While such text is not illegal, it is not practially meaningful; yet the possibility of encountering it forces a conformant, serializing implemenation to provide large buffer capacity, or provide a special exception mechanism just for such degenerate cases. To address this situation, the following mechanism is specified: D5. A string is said to be in *Bounded Reordering Format* if it does not contain any sequences of non-starters longer than 30 characters in length. * Such a string is guaranteed to never contain any sequence of non-starters longer than 30. The value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage. While it would have been feasible to chose a smaller number, this value provides a very wide margin. D6. The *Bounded Reordering Process* is defined as the process of producing a string in Bounded Reordering Format by processing the text from start to finish, inserting a CGJ character between any sequence of 25 non-starters and a following non-starter. * The reason for inserting the CGJ characters every 25 characters is to allow for any possible expansion of combining characters in a possible subsequent normalization. With any normal text, the Bounded Reordering Process will not modify the text at all. Where modification does result, it is no longer canonically equivalent to the original, but the modifications are minor and don't disturb any meaningful content. The modified text contains all of the content of the original, with the only difference being that reordering is blocked across groups of 25 non-starters. Any text in Bounded Reordering Format can then be normalized with very small buffers using any of the standard Normalization Forms. This logical process can be implemented in the same implementation pass as normalization for efficient processing. In such case, the choice of whether to do the Bounded Reordering Process or not can be controlled by an input parameter. * Add a conformance clause, so that people can refer to it. UAX15-C4. A process that purports to transform text according to the Bounded Reordering Process must do so in accordance with the specifications in this document. * For more information, see Section XX