L2/06-103 Source: Mark Davis Date: 2006-04-03 Subject: Stability limits for buffer overruns ================== Currently we have a strong, but unstated, policy against encoding new precomposed forms. What we actually have as a formal policy, however, is a slightly weaker policy that just prevents changes in normalization of existing characters. Now, in practice, it is pretty useless to encode new precomposed forms since they will decompose in NFC, which is our recommended form for general text interchange, but we don't absolutely forbid it. It would also be possible, under our current policy, to add a new precomposed form to NFC *if* both the precomposed character *and* at least one of its components are new in a given release (since that also wouldn't disturb existing normalizations) -- however, given the remaining scripts we have to encode, in practice we really don't need that option. There is reason to tighten our policies further, formally: which is that people are very sensitive to buffer overflow issues. Right now, no character when mapped to NFC expands by more than 3X (in bytes, in any encoding form: UTF-8, UTF-16, UTF-32). Cases now that hit that limit are: 8 U+1D160 (𝅘𝅥𝅮) 16, 32 U+FB2C (שּׁ) If we ever were to add a new precomposed form that expanded by more than 3X, it could break that expectation; so formally people now have to plan for an arbitrary increase in size in the future. We have a similar issue in case-folding, which is used for case-insensitive comparison. The current greatest expansion is also 3X (an example being U+0390 (ΐ)). I recommend that the UTC petition the officers to add additional stability policies: A. When any string is converted to NFC, it will never expand in bytes by more than three times, whether it is in UTF-8, UTF-16, or UTF-32 * In practice, only in rare cases will strings expand at in NFC, but programs must be cognizant of the possibility of expansion to avoid buffer overruns. B. When any string is converted to a case-folded form, it will never expand in bytes by more than 3 times, whether it is in UTF-8, UTF-16, or UTF-32. * In practice, only in rare cases will strings expand at in case-folding, but programs must be cognizant of the possibility of expansion to avoid buffer overruns.