From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 06 2005 - 15:35:11 CDT
From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> If we think that the characters deserve "characterhood" in Unicode, the
> natural step would be to define names for them, as defined in UAX #34,
> "Unicode Named Character Sequences",
> ( http://www.unicode.org/reports/tr34/ )
> I was actually somewhat surprised at seeing that the list of currently
> defined named character sequences does not contain any Cyrillic letters
> with diacritic mark. Maybe the idea has not become popular.
I do think that the named character sequences can really help fixing how
"missing" letters with diacritics can be composed. This new list has to
become more popular, and better referenced in the standard, and will help
for interoperability, as it complements very well the additional canonical
compositions found in the main UCD file.
By listing a named character here, it allows implementations to consider the
composed sequence as a single entity that should not be broken in most
processes.
May be this list should also come to interest to the UCA collation. I am not
sure that the default collation table is consistent with this list, and may
be the DUCET should map entries for those composed sequences. (I consider
that UCA collation is heavily linked to the concept of characters as
perceived by users, and as recognized in the standard with Unicode Named
Character Sequences).
Suppose for example that LATIN SMALL LETTER A WITH TILDE was not encoded in
Unicode, then we would have to encode LATIN SMALL LETTER A, then a COMBINING
TILDE, and consider that in most processings the pair will be used as a
single entity. So the DUCET should map the named sequence "LATIN SMALL
LETTER A WITH TILDE" with a single entry.
However the DUCET is designed to map only isolated characters, not combining
sequences. This may create difficulties when sorting the other LATIN SMALL
LETTER A WITH ACUTE AND TILDE combining sequence, because with the
normalized form, the COMBINING ACUTE ACCENT would occur in the encoded
combining sequence before the COMBINING TILDE. The solution to this problem
would then become to add another named sequence for the 3 characters.
So a good question is:
Can a "Unicode Named Character Sequence" be recognized as a single entity,
when there are other combining characters in the middle of the sequence, and
when moving those extra combining characters at end of the named sequence is
still canonically equivalent? My opinion is that such named sequence should
still be recognized (due to the canonical equivalence), to help for
interoperability.
This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 15:36:18 CDT