Re: Umlaut and Tréma, was: Variation selectors and vowel marks

From: Asmus Freytag (
Date: Thu Jul 15 2004 - 04:32:29 CDT

  • Next message: Antoine Leca: "Re: Importance of diacritics"

    Nobody doubts that some text exists with multiple accents on vowels. Where
    the vowels are not Latin a,o,u, there is no issue at all, in this case,
    since there are no differences in German sorting for them. Where the vowels
    are a, o, u, as for the Livonian example you cited, it's a matter of the
    design of the collation table to get the correct sorting behavior.

    If there is anything in UCA that would make it impossible to design correct
    collation tables for German university libraries, when CGJ is used with
    Trema, but not for umlaut, then you have an issue. At the moment, I see
    lots of speculation, and red herrings (Greek and Coptic, indeed!) but no
    smoking gun.

    And yes, the incidence of Livonian data (relative to trema, which is rather
    uncommon relative to umlaut) may be below a threshold where providing a
    support short of the theoretical optimum is a practical concern. That
    decision belongs to the German bibliographers.


    At 02:13 AM 7/15/2004, Peter Kirk wrote:
    >On 15/07/2004 05:00, Asmus Freytag wrote:
    >>At 01:52 PM 7/14/2004, Doug Ewell wrote:
    >>>It's not German data (with umlauts) that will be affected by this
    >>>solution, but non-German data (with diaereses) in German bibliographic
    >>>systems. That makes it a much smaller problem.
    >>the use of diaeresis is perfectly valid for words in fields that have a
    >>language ID 'German'.
    >>>The DIN request and the USNB solution didn't address this, because the
    >>>problem to be solved was disambiguating {a, o, u}-with-tréma from {a, o,
    >>>u}-with-umlaut. If there are combinations of (for example)
    >>>a-with-tréma-and-something-else AND ALSO
    >>>a-with-umlaut-and-something-else, then those two will need to be
    >>>disambiguated somehow. But I strongly doubt that the latter case exists
    >>>in German bibliographic data, though of course one never knows.
    >>First off, there have to be corresponding entries in the sorting tables
    >>used for such data, to make that distinction have the correct effect.
    >>Since the sorting tables would not support anything ohter than <BASE,
    >>CGJ, DIAERESIS> there's no reason to introduce other sequences into the data.
    >>Secondly, the dieresis is used to indicate that two vowels are pronounced
    >>separately. I haven't seen a case where the vowels would already be accented.
    >There are such cases (although in most but not all of them technically the
    >vowel is not "already" accented because the diaeresis is encoded closer to
    >the base letter than the accent). This is certainly the case in Greek,
    >where diaeresis (indicating separate pronunciation) and accents commonly
    >occur on the same vowel; there are precomposed forms in the Greek and
    >Coptic and Greek Extended blocks. There are also a number of precomposed
    >forms in Latin Extended-B and Latin Extended Additional with both
    >diaeresis and another accent. Presumably these are used for some language
    >or other (well, some for Pinyin, some for Livonian, others unspecified).
    >And so they may occur in German bibliographic data. And in that database
    >each of them must have been encoded either with umlaut or with tréma
    >(presumably because they are understood as marking either a vowel quality
    >modification or a separation), and there is at least the possibility that
    >some combinations may have been encoded differently in different places in
    >the database. (And foreign words may be used within book titles marked as
    >German.) Therefore Unicode does need to consider the issue, both as a
    >theoretical one (which is essentially equivalent in terms of its effect on
    >normalisation to the theoretical problem with using variation selectors
    >with combining characters) and potentially as a practical one.
    >Peter Kirk
    > (personal)
    > (work)

    This archive was generated by hypermail 2.1.5 : Thu Jul 15 2004 - 04:33:21 CDT