Source: Mark Davis
Date: July 18, 2006
Subject: Text fragments for UAX#15
Here is are some fragments of text that didn't make the final draft of UAX#15, but are grist for a future version.
Using buffers for normalization does require that characters be emptied from the buffer correctly. That is, as decompositions are appended to the buffer, periodically the end of the buffer will be reached. At that time, the characters in the buffer up to but not including the last character with the property value Quick_Check=Yes (QC=Y) must be canonically ordered (and if NFC and NFKC are being generated, must also be composed), and only then flushed.
Consider the following example. Text is being normalized into NFC with a buffer size of 40. The buffer has been successively filled with decompositions, and has two remaining slots. The decomposition takes three characters, and wouldn't fit. The last character with QC=Y is the "s", marked in red below.
T h e c ◌́ a ... p ◌̃ q r ◌́ s ◌́ 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39
u ◌̃ ◌́ 0 1 32
Thus the buffer up to but not including "s" needs to be composed, and flushed. Once this is done, the decomposition can be appended, and the buffer is left in the following state:
s ◌́ u ◌̃ ◌́ ... 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39
Implementations may also canonically order (and compose) the contents of the buffer as they go; the key requirement is that they cannot compose a sequence until a following character with the property QC=Y is encountered. For example, if that had been done in the above example, then during the course of filling the buffer, we would have had the following state, where "c" is the last character with QC=Y.
T h e c ◌́ 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39
When the "a" (with QC=Y) is to be appended to the buffer, it is then safe to compose the "c" and all subsequent characters, and then enter in the "a", marking it as the last character with QC=Y.
T h e ć a 0 1 2 3 4 5 6 ... 31 32 33 34 35 36 37 38 39
For more information on the Quick_Check property, see Section 14, Detecting Normalization Forms.
Implementations can optimize the above specification as long as they produce the same results. In particular, the information used in Step 3 of D6 can be precomputed: it does not require the actual normalization of the character. For Unicode 5.0, for example, the precomputed data for numbers of initial and trailing non-starters are shown in Table 12.
Table 12. Unicode 5.0 Precomputed Data
Row Total Starter?
Example Initial Non-Starters Contains Starter? Trailing Non-Starters A 1,112,646 Y 0 Y 0 - all others B 793 Y 0 Y 1
DIAERESIS C 248 Y 0 Y 2
LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON D 36 Y 0 Y 3
GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI E 2 Y n/a N 1
HALFWIDTH KATAKANA VOICED SOUND MARK F 3 Y n/a N 2
TIBETAN VOWEL SIGN II G 383
n/a N 1
COMBINING GRAVE ACCENT H 1
n/a N 2
COMBINING GREEK DIALYTIKA TONOS
The characters from rows E-H are all the characters whose NFKD form does not contain a starter in Unicode 5.0.
From this data it is clear that there is a shortest sequence which, once decomposed, can approach the defined limit of 30 non-starters in a row. This shortest sequence would consist of a character such as U+1F82, which expands into a starter plus three trailing non-starters, followed by a sequence of 13 characters such as U+0344, which would expand into a sequence of 26 non-starters. Thus only a sequence of more than 13 characters from rows E-H could generate more than 30 non-starters in a row, and require normalization to be invoked in Step 3 of D6. A sequence of 13 E-H characters will be exceedingly rare; thus an implementation for the Stream-Safe Text Process for Unicode 5.0 using this optimization will almost never need to actually do any normalization.
Note: This particular limit of 13 E-H characters is valid for Unicode 5.0 and all prior versions of the standard; it may change in future versions, and certainly the set of characters that qualify as E-H characters will expand in future versions. Thus even though the definition of the Stream-Safe Text Process is stable, the data for this particular optimization is not: it needs to be recalculated for each new version of Unicode.