Re: case conversion -> longer UTF8?

From: Mark Davis (marked@best.com)
Date: Sun Apr 04 1999 - 14:45:18 EDT


Yes, changes in size when casing can happen for many reasons.

1. The case conversion causes an expansion. For example:

0149; 0149; 02BC 006E; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY
APOSTROPHE

1F80; 1F80; 1F88; 1F00 03B9; # GREEK SMALL LETTER ALPHA WITH PSILI AND
YPOGEGRAMMENI

2. The case mapping crosses a UTF-8 size boundary. These boundaries are
at 7F, 3FF, FFFF.
Examples:

0049; 0049; 0131; 0131; tr; # LATIN CAPITAL LETTER I (in Turkish)

3. You can also get shrinkage in UTF-8, because of boundary crossing!
Example:

017F;LATIN SMALL LETTER LONG S;Ll;0;L;<compat> 0073;;;;N;;;0053;;0053

4. By chance, 00DF (es-zed) happens to expand because of character
expansion, and contract (in UTF-8) because of boundary crossing, thus
ending up with the same number of bytes!

5. Look at these files for examples; future versions of the standard may
add other examples as well.

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

BTW, there were some production problems with SpecialCasing.txt that
resulted in some bad mappings. This will be corrected soon.

Mark

Hallvard B Furuseth wrote:

> Can a UTF-8 string ever become longer when it's converted to upper- or
> lowercase?
>
> --
> Hallvard

--
business: mark.davis@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT