Yes, changes in size when casing can happen for many reasons.
1. The case conversion causes an expansion. For example:
0149; 0149; 02BC 006E; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY
APOSTROPHE
1F80; 1F80; 1F88; 1F00 03B9; # GREEK SMALL LETTER ALPHA WITH PSILI AND
YPOGEGRAMMENI
2. The case mapping crosses a UTF-8 size boundary. These boundaries are
at 7F, 3FF, FFFF.
Examples:
0049; 0049; 0131; 0131; tr; # LATIN CAPITAL LETTER I (in Turkish)
3. You can also get shrinkage in UTF-8, because of boundary crossing!
Example:
017F;LATIN SMALL LETTER LONG S;Ll;0;L;<compat> 0073;;;;N;;;0053;;0053
4. By chance, 00DF (es-zed) happens to expand because of character
expansion, and contract (in UTF-8) because of boundary crossing, thus
ending up with the same number of bytes!
5. Look at these files for examples; future versions of the standard may
add other examples as well.
ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
BTW, there were some production problems with SpecialCasing.txt that
resulted in some bad mappings. This will be corrected soon.
Mark
Hallvard B Furuseth wrote:
> Can a UTF-8 string ever become longer when it's converted to upper- or
> lowercase?
>
> --
> Hallvard
-- business: mark.davis@us.ibm.com, mark@unicode.org personal: mark@macchiato.com, http://www.macchiato.com --
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT