Re: Umlaut and Tréma, was: Variation selectors and vowel marks

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jul 15 2004 - 05:42:09 CDT

  • Next message: Peter Kirk: "Re: Umlaut and Trma, was: Variation selectors and vowel marks"

    On 15/07/2004 10:32, Asmus Freytag wrote:

    > Nobody doubts that some text exists with multiple accents on vowels.
    > Where the vowels are not Latin a,o,u, there is no issue at all, in
    > this case, since there are no differences in German sorting for them. ...

    Well, yes, but http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2819.pdf, does not
    make it clear that the <CGJ, DIAERESIS> sequence is to be used only with
    Latin a, o and u; rather it states "<CGJ, [DIAERESIS]> → tréma". Perhaps
    the proposal needs modification to make this point clear, if that is the
    intention.

    > ... Where the vowels are a, o, u, as for the Livonian example you
    > cited, it's a matter of the design of the collation table to get the
    > correct sorting behavior.
    >
    > If there is anything in UCA that would make it impossible to design
    > correct collation tables for German university libraries, when CGJ is
    > used with Trema, but not for umlaut, then you have an issue. At the
    > moment, I see lots of speculation, and red herrings (Greek and Coptic,
    > indeed!) but no smoking gun.

    Greek and Coptic is not irrelevant. First, you did not restrict the set
    of base characters when you wrote:

    > Secondly, the dieresis is used to indicate that two vowels are
    > pronounced separately. I haven't seen a case where the vowels would
    > already be accented.

    and of course the diaeresis and accent characters used in Greek are the
    same ones used in Latin script. Second, N2819 does not make it clear
    that the <CGJ, DIAERESIS> sequence is to be used only for Latin script
    data. I would expect (someone can check this of course, and without
    checking this is indeed speculation) that there is Greek text in German
    bibliographic databases in which the Greek diaeresis is represented in
    ISO 5426 as tréma rather than umlaut; that would be correct because the
    function of Greek diaeresis is separation rather than vowel
    modification. And I would expect an implementer reading N2819 to
    conclude that all ISO 5426 trémas should be converted to <CGJ,
    DIAERESIS> as no mention is made of a restriction to Latin script or to
    just a, o and u. So there is a real chance of a conversion program
    producing sequences which could confuse normalisation, e.g. <IOTA, CGJ,
    DIAERESIS, ACUTE>, although hopefully not <IOTA, ACUTE, CGJ, DIAERESIS>
    which might be a real problem.

    >
    > And yes, the incidence of Livonian data (relative to trema, which is
    > rather uncommon relative to umlaut) may be below a threshold where
    > providing a support short of the theoretical optimum is a practical
    > concern. That decision belongs to the German bibliographers.
    >
    Well, it seems that we are agreeing that there may be a problem in
    theory, and potentially in practice with small amounts of marginal data,
    but Unicode is choosing to leave the problem for the specific users of
    the sequence to deal with. That is indeed a reasonable approach. But it
    was not considered an acceptable one for use of variation selectors with
    combining marks, even in a case where there is no valid data which
    actually exhibits the normalisation problem.

    My concern as always is with the apparent inconsistency of bending the
    normal rules or ignoring the normalisation concerns for German while
    refusing to do more or less the same for Hebrew. I appreciate that
    Germany is a larger and richer country than Israel and so, at least for
    commercial interests, its concerns deserve some priority. But that
    should not be a reason to reject as invalid or insignificant issues
    concerning Hebrew. And the issue of avoiding incompatible representation
    of the same data is a real one for Hebrew Holam Male vs. Vav Haluma just
    as it is for German umlaut vs. tréma.

    I am not actually asking for variation selectors with combining marks
    because I realise that the UTC has already made a decision and is
    unlikely to reverse it. But I am asking for some flexibility on some of
    the principles, of the kind which has been demonstrated with umlaut and
    tréma, and also in the Indic scripts proposal under review, in order to
    find an acceptable solution to a real problem. That flexibility might
    include allowing either <VAV, variation selector, HOLAM> or <VAV, ZWJ,
    HOLAM> to represent Holam Male although technically the VAV glyph does
    not (usually) change (nor does the HOLAM glyph) and the HOLAM dot does
    not ligate with the it, just moves relative to it.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Jul 15 2004 - 05:43:19 CDT