Resolving dotted and dotless "i"

From: John Cowan (cowan@drv.cbc.com)
Date: Wed Sep 10 1997 - 13:21:42 EDT

Next message: Randy Williams: "RE: RE: Is it a font or an encoding?"
Previous message: Hart, Edwin F.: "FW: Is it a font or an encoding?"
Next in thread: Joan Aliprand: "Re: Resolving dotted and dotless "i""
Maybe reply: Joan Aliprand: "Re: Resolving dotted and dotless "i""
Maybe reply: Mark Davis: "Re: Resolving dotted and dotless "i""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Title: Resolving dotted and dotless "i"
Source: John Cowan <cowan@ccil.org>
Primary Author: John Cowan (no affiliation)
Status: Expert contribution
Action: For the consideration of UTC
References: None
Distribution: UTC and elsewhere as appropriate

Summary: This proposal urges UTC to create two new characters
   in the Latin Extended-B block, to be known as LATIN CAPITAL
   LETTER DOTLESS I and LATIN SMALL LETTER I WITH DOT ABOVE.
   Making these characters available extends existing precedents
   to assist in language-independent case-mapping efforts.
   The first of these characters is considered essential; the
   second, being a compatibility character, could be omitted,
   but its presence makes 1-1 roundtrip case-mapping possible.

Note to UTC Secretary: Please acknowledge receipt of this proposal.

The Problem:

Currently it is difficult or impossible without extra context to
correctly case-map the characters U+0069 LATIN SMALL LETTER I and
U+0049 LATIN CAPITAL LETTER I. In most languages, these two characters
are lowercase and uppercase versions of each other, respectively.
Turkish, however, employs different case-mappings: the uppercase
version of LATIN SMALL LETTER I is U+0130 LATIN CAPITAL LETTER I WITH
DOT ABOVE, and the lowercase version of LATIN CAPITAL LETTER I is
U+0131 LATIN SMALL LETTER DOTLESS I.

The extra context that is required is not always available,
and even if it is available or can be guessed at (by detecting the
presence of characteristically Turkish characters or character
sequences), it imposes a burden on every user of case-mapping
either to do additional processing or to accept occasionally
incorrect results or both.

There are other cases of language-specific case-mappings, the
best known of which is the convention that French accented
vowels lose the accent when capitalized. This is not true in
Canadian French, and I am told that it is an artifact of
French typewriters and is in the process of being discarded
even in France. The Turkish case-mapping, however, is deeply
embedded in Turkish orthography and is not likely to change.

The Proposed Solution:

Introduce two new characters, LATIN CAPITAL LETTER DOTLESS I
and LATIN SMALL LETTER I WITH DOT ABOVE. The first of these
has the same glyphic representation as LATIN CAPITAL LETTER I,
and the second has the same glyphic representation as LATIN
SMALL LETTER I. In addition, LATIN SMALL LETTER I WITH DOT
ABOVE is a compatibility character, canonically equivalent to
LATIN SMALL LETTER I followed by U+0307 COMBINING DOT ABOVE.

Precedents:

There is precedent in Unicode for introducing clones of capital
letters that have distinct lower-case forms. In particular,
U+00D0 LATIN CAPITAL LETTER ETH, U+0110 LATIN CAPITAL LETTER D
WITH STROKE, and U+0189 LATIN CAPITAL LETTER AFRICAN D all have
the same glyphic form. They are kept distinct, however, because
of their lowercase forms (respectively U+00F0, U+0111, U+0256),
all of which have distinct glyphs.

Conversely, U+0259 LATIN SMALL LETTER SCHWA and U+01DD LATIN
SMALL LETTER TURNED E are glyphically identical, but are kept
distinct as characters because of their glyphically distinct
uppercase forms U+018F and U+018E.

In addition, there are many Unicode non-letter characters that
are glyphically identical but have been given distinct codepoints
for semantic reasons, typically where an loose-semantics ASCII
character has been given multiple equivalents, each with
precise semantics.

Combining Characters:

The Unicode Standard (version 2.0, p. 6-7) prescribes that LATIN
SMALL LETTER I followed by any top diacritic loses its native dot;
consequently, LATIN SMALL LETTER I followed by COMBINING DOT ABOVE
looks exactly like LATIN SMALL LETTER I with no diacritic. When
mapped to uppercase, however, the difference is visible.

Unicode already specifies that the sequence of LATIN CAPITAL LETTER I
followed by COMBINING DOT ABOVE is canonically equivalent to
LATIN CAPITAL LETTER I WITH DOT ABOVE. Introducing the new
character LATIN SMALL LETTER I WITH DOT ABOVE simply provides a
correct 1-1 roundtrip case-mapping for LATIN CAPITAL LETTER I WITH
DOT ABOVE.

If the proposed LATIN SMALL LETTER I WITH DOT ABOVE is rejected as
introducing an unnecessary compatibility character, then the same effect
can be produced by representing it in the decomposed form and
case-mapping each character individually.

Character Properties:

The entries in the Character Properties database for these new
characters would look something like this:

0218;LATIN CAPITAL LETTER DOTLESS I;Lu;0;L;;;;;N;;;;0131;
0219;LATIN SMALL LETTER I WITH DOT ABOVE;Ll;0;L;;;;;N;;;0130;;0130

In addition, the lowercase mapping for U+0130 would be changed
from 0069 to 0219, and the uppercase and titlecase mappings for
U+0131 would be changed from 0049 to 0218.

If the proposed U+0219 is rejected, then at any rate the lowercase
mapping for U+0130 should be set to null rather than the incorrect
0069, which loses information. Correct lowercasing can then be
achieved by decomposition.

Transcoding Issues:

The chief objection to this proposal would be that transcoding
from 8859-9 to Unicode would become more difficult. Incoming
4/9 and 6/9 characters would be mapped to U+0218 and U+0219
only in Turkish text, whereas non-Turkish text would require the
default mappings to U+0049 and U+0069. This would undoubtedly
be an additional burden.

(If the character U+0219 is rejected as unnecessary, then 6/9
would be transcoded under this proposal to U+0069 U+0307.)

On the other hand, this means that the language-interpretation
effort needs to be done only once, when the text is imported
into the Unicode world, rather than every time a case-mapping
must be performed. It also means that the burden is placed only
on those who use Turkish text, rather than on every user.

Undoubtedly some systems would take a simplified approach to
transcoding, assuming that all 8859-9 text is Turkish and using
U+0218 and U+0219 exclusively. This would produce occasional
bad results generally similar to those achieved when LATIN CAPITAL
LETTER A is substituted for CYRILLIC CAPITAL LETTER A or the
like. Nevertheless, the results would still be interpretable.

###

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
			e'osai ko sarji la lojban

Next message: Randy Williams: "RE: RE: Is it a font or an encoding?"
Previous message: Hart, Edwin F.: "FW: Is it a font or an encoding?"
Next in thread: Joan Aliprand: "Re: Resolving dotted and dotless "i""
Maybe reply: Joan Aliprand: "Re: Resolving dotted and dotless "i""
Maybe reply: Mark Davis: "Re: Resolving dotted and dotless "i""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT