Fixing Two Unicode Asymmetries in case conversion

From: Marco Mussini (marco.mussini@vim.tlt.alcatel.it)
Date: Thu Nov 12 1998 - 07:09:38 EST


Hi there,

please consider these two cases:

(A) Turkish letter "dotless I"

What Turkish language requires is:

Turkish language has small and uppercase letter "i" both in dotless and
dotted versions.
When converting case, the dotless version shall remain dotless and the
dotted version shall remain dotted.

Current Unicode situation is:

LATIN CAPITAL LETTER I WITH DOT ABOVE (\u0130)
has the following lowercase equivalent:
LATIN SMALL LETTER I (\u0069)

LATIN SMALL LETTER DOTLESS I
has the following uppercase equivalent:
LATIN CAPITAL LETTER I (\u0049)

In addition, Unicode has the usual Western correspondence between:
LATIN SMALL LETTER I (\u0069)
and
LATIN CAPITAL LETTER I (\u0049)

This leads to the following problem while implementing a case conversion
routine:
you have to keep language into account to correctly process letter "i"
and "I"; you must have two different behaviours for converting case for
letter "i"/"I" in Turkish and in other languages.

This problem could be fixed by adopting the following suggestion in the
Unicode standard:

- introduce a dedicated codepoint for the capital dotless letter "I"
coming from LATIN SMALL LETTER DOTLESS I, and give it LATIN SMALL LETTER
DOTLESS I as its lowercase correspondent.

- introduce a dedicated codepoint for the lowercase dotted letter "i"
coming from LATIN CAPITAL LETTER I WITH DOT ABOVE, and give it LATIN
CAPITAL LETTER I WITH DOT ABOVE as its uppercase correspondent.

We would have a total of 6 codepoints to handle the case of the "i"
letter in both Turkish and nonTurkish languages.

(B) The second problem is about the German sharp S issue:

- there is a codepoint for LATIN SMALL LETTER SHARP S (\u00df) which
does not have any uppercase correspondent defined, but according to
German language rules, you have to convert it into "SS" when you go to
upper case. Since there is no dedicated codepoint for the "double S",
you have to (1) take care of this special case explicitly in your case
conversion code; (2) you must grow the string previously containing the
Sharp S character to accommodate the extra character.

Both of these disadvantages would be solved if Unicode could introduce a
new dedicated codepoint for double S and put it into bidirectional
correspondence with Sharp S.

We would like to have you opinion about this matter.

--A. Borsotti, M. Mussini
Alcatel
Italy



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT