L2/01-053

From: Markus Scherer [markus.scherer@jtcsv.com]
Sent: Tuesday, January 23, 2001 7:53 PM
Subject: Agenda item: case mappings should not cross planes

Hello,

This just came up, and I suggest to add it to the agenda of next week's UTC meeting if possible -


For simple case mappings of Unicode UTF-16 strings using just the 1:1 mappings in UnicodeData.txt, it is desirable that all BMP code points have mappings to other BMP code points, and all supplementary code points have mappings to other supplementary code points.

Proposal: To state in the Unicode Standard that simple, UnicodeData.txt case mappings will not map from a BMP code point to a supplementary one, and will not map from a supplementary code point to one in the BMP.


This is for functions similar to the common (but not ANSI) C stdlib functions that are called strupr(), strlwr(), or similar. They modify a code unit buffer in-place, assuming that the length of the string does not change.


With just the simple UnicodeData.txt mappings, this is no problem for UTF-32 implementations.

For UTF-16 this means that the BMP boundary must not be crossed for case mappings.

I am not sure if this is the same for UTF-8; I have not checked UnicodeData.txt for case mappings that cross the additional two boundaries at U+0080 and U+0800.


This discussion does not apply to full I18N case mappings with SpecialCasing.txt because APIs for those must handle growing and shrinking strings anyway.

I am currently working on C APIs for ICU, and am planning for simple, in-place string case mapping functions as well as for full-fledged, everything-SpecialCasing ones with an input and an output buffer.
Until now, we only had a C++ implementation using our string class, where this is not an API issue - but even there our implementation makes this assumption.

Thanks for your consideration,
markus