Re: Cyrillic - accented/acuted vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 06 2005 - 15:35:11 CDT

Next message: Philippe Verdy: "Re: Announcement of Changes to the Unicode Membership structure"

Previous message: Rick McGowan: "Announcement of Changes to the Unicode Membership structure"
In reply to: Jukka K. Korpela: "Re: Cyrillic - accented/acuted vowels"
Next in thread: Peter Kirk: "Re: Cyrillic - accented/acuted vowels"
Reply: Peter Kirk: "Re: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> If we think that the characters deserve "characterhood" in Unicode, the
> natural step would be to define names for them, as defined in UAX #34,
> "Unicode Named Character Sequences",
> ( http://www.unicode.org/reports/tr34/ )
> I was actually somewhat surprised at seeing that the list of currently
> defined named character sequences does not contain any Cyrillic letters
> with diacritic mark. Maybe the idea has not become popular.

I do think that the named character sequences can really help fixing how
"missing" letters with diacritics can be composed. This new list has to
become more popular, and better referenced in the standard, and will help
for interoperability, as it complements very well the additional canonical
compositions found in the main UCD file.
By listing a named character here, it allows implementations to consider the
composed sequence as a single entity that should not be broken in most
processes.

May be this list should also come to interest to the UCA collation. I am not
sure that the default collation table is consistent with this list, and may
be the DUCET should map entries for those composed sequences. (I consider
that UCA collation is heavily linked to the concept of characters as
perceived by users, and as recognized in the standard with Unicode Named
Character Sequences).

Suppose for example that LATIN SMALL LETTER A WITH TILDE was not encoded in
Unicode, then we would have to encode LATIN SMALL LETTER A, then a COMBINING
TILDE, and consider that in most processings the pair will be used as a
single entity. So the DUCET should map the named sequence "LATIN SMALL
LETTER A WITH TILDE" with a single entry.

However the DUCET is designed to map only isolated characters, not combining
sequences. This may create difficulties when sorting the other LATIN SMALL
LETTER A WITH ACUTE AND TILDE combining sequence, because with the
normalized form, the COMBINING ACUTE ACCENT would occur in the encoded
combining sequence before the COMBINING TILDE. The solution to this problem
would then become to add another named sequence for the 3 characters.

So a good question is:
Can a "Unicode Named Character Sequence" be recognized as a single entity,
when there are other combining characters in the middle of the sequence, and
when moving those extra combining characters at end of the named sequence is
still canonically equivalent? My opinion is that such named sequence should
still be recognized (due to the canonical equivalence), to help for
interoperability.

Next message: Philippe Verdy: "Re: Announcement of Changes to the Unicode Membership structure"
Previous message: Rick McGowan: "Announcement of Changes to the Unicode Membership structure"
In reply to: Jukka K. Korpela: "Re: Cyrillic - accented/acuted vowels"
Next in thread: Peter Kirk: "Re: Cyrillic - accented/acuted vowels"
Reply: Peter Kirk: "Re: Cyrillic - accented/acuted vowels"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 15:36:18 CDT