Re: Cyrillic - accented/acuted vowels

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 06 2005 - 15:35:11 CDT

  • Next message: Philippe Verdy: "Re: Announcement of Changes to the Unicode Membership structure"

    From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
    > If we think that the characters deserve "characterhood" in Unicode, the
    > natural step would be to define names for them, as defined in UAX #34,
    > "Unicode Named Character Sequences",
    > ( http://www.unicode.org/reports/tr34/ )
    > I was actually somewhat surprised at seeing that the list of currently
    > defined named character sequences does not contain any Cyrillic letters
    > with diacritic mark. Maybe the idea has not become popular.

    I do think that the named character sequences can really help fixing how
    "missing" letters with diacritics can be composed. This new list has to
    become more popular, and better referenced in the standard, and will help
    for interoperability, as it complements very well the additional canonical
    compositions found in the main UCD file.
    By listing a named character here, it allows implementations to consider the
    composed sequence as a single entity that should not be broken in most
    processes.

    May be this list should also come to interest to the UCA collation. I am not
    sure that the default collation table is consistent with this list, and may
    be the DUCET should map entries for those composed sequences. (I consider
    that UCA collation is heavily linked to the concept of characters as
    perceived by users, and as recognized in the standard with Unicode Named
    Character Sequences).

    Suppose for example that LATIN SMALL LETTER A WITH TILDE was not encoded in
    Unicode, then we would have to encode LATIN SMALL LETTER A, then a COMBINING
    TILDE, and consider that in most processings the pair will be used as a
    single entity. So the DUCET should map the named sequence "LATIN SMALL
    LETTER A WITH TILDE" with a single entry.

    However the DUCET is designed to map only isolated characters, not combining
    sequences. This may create difficulties when sorting the other LATIN SMALL
    LETTER A WITH ACUTE AND TILDE combining sequence, because with the
    normalized form, the COMBINING ACUTE ACCENT would occur in the encoded
    combining sequence before the COMBINING TILDE. The solution to this problem
    would then become to add another named sequence for the 3 characters.

    So a good question is:
    Can a "Unicode Named Character Sequence" be recognized as a single entity,
    when there are other combining characters in the middle of the sequence, and
    when moving those extra combining characters at end of the named sequence is
    still canonically equivalent? My opinion is that such named sequence should
    still be recognized (due to the canonical equivalence), to help for
    interoperability.



    This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 15:36:18 CDT