Re: Umlaut and Tréma, was: Variation sele ctors and vowel marks

From: Peter Kirk (
Date: Wed Jul 14 2004 - 14:08:45 CDT

  • Next message: Doug Ewell: "Re: Umlaut and Tréma, was: Variation sele ctors and vowel marks"

    On 14/07/2004 18:40, Kenneth Whistler wrote:

    > ...
    >>OK. But this is not a unique case. For example, in Hebrew Silluq and
    >>Meteg, Dagesh and Shuruq are pairs of different marks which share a
    >>glyph and so a Unicode character but may need to be distinguished for
    >>certain processes.
    >Can you show a pre-existing ISO character encoding standard, such
    >as ISO 5429, for which there are bibliographic implementations
    >whose conversion to Unicode is blocked by an encoding distinction
    >not maintained in Unicode for these particular cases? ...

    No, but I can show a pre-existing clearly defined encoding, see dated 1982, especially point 1
    "We now distinguish holem waw (`OW') from waw followed by holem", i.e.
    Holam Male from Vav Haluma, and point 2 re the three variants of Meteg.
    Texts based on these encodings have been in the public domain and
    circulated widely since 1982, and are available from such repositories
    as CCAT and the Oxford Text Archive. Conversion of these texts to
    Unicode is blocked by the current failure of Unicode to distinguish
    Holam Male from Vav Haluma and to distinguish three poisitions of Meteg.

    >... If so, then
    >you would have an analogous situation. ...

    The only lack of analogy is that no one sought to get official ISO
    approval for an encoding which has been a de facto standard among
    Hebraists for more than 20 years.

    >... If not, then you are simply
    >talking about functional distinctions for the same encoded diacritic,
    >which might be needed to be maintained for some kinds of processing,
    >for which people can use whatever kinds of conventions they sit
    >fit to deal with the issue -- but the issue doesn't rise to the
    >level of an encoding issue requiring formal intervention by WG2,
    >in my opinion.

    I accept that this may be true of the Meteg/Silluq and Dagesh/Shuruq
    distinctions; but not of the Holam male/Vav Haluma and Meteg positioning
    distinctions which do involve graphical distinctions.

    >>Should similar encodings with CGJ be proposed to make
    >>these distinctions?
    >If formal maintenance of a collation distinction between two
    >otherwise identically *appearing* pieces of text -- based on
    >whatever analytic status of the text is relevant -- is at issue,
    >then representation of one sequence with CGJ and one without
    >is a recommended way by the Unicode Standard to introduce a
    >distinction which a tailored collation can then weight differently
    >to get the required collation difference.
    OK. But the problem here is that sometimes there *is* a graphical
    distinction between umlaut and tréma, and one might expect
    bibliographers to make use of fonts which do make the distinction to
    view their data. Unfortunately the chosen encoding with CGJ is not
    supposed to support such graphical distinctions even when they would of
    course be very helpful for maintenance of a database of mixed data. It
    seems to me that this solution will also "result in massive
    data representation ambiguities for German data" (quote from N2819). But
    then my main interest is not in German but in Hebrew.

    > ...
    >>256 variation selectors won't do if they have all been defined
    >>unchangeably with the wrong properties e.g combining class. On the other
    >>hand, if the UTC is prepared to ignore the combining class and
    >>normalisation problems involved in using one combining class zero
    >>character, CGJ, to modify a combining mark,
    >This completely misconstrues the solution in question for the
    >German umlaut and tréma in bibliographic records. The CGJ is
    >not introduced "to modify a combining mark". Instead, two
    >text elements required to be distinguished in German bibliographic
    >data are represented by two distinct sequences:
    >This is completely in keeping with the intent of the CGJ in the
    >standard, and the proposal did not, in any way, "ignore the
    >combining class and normalisation problems" in this case.
    >... Which, by the way, is why the solution met with unanimous
    >approval in WG2, without objection from the UTC liaison.
    N2819 does not deal with the issue of how to encode a base character (X)
    plus tréma and another combining mark (M). Should this be <X, M, CGJ,
    COMBINING DIAERESIS, M>? How is this issue affected by whether the
    combining class of M is less than, equal to or greater than that of
    COMBINING DIAERESIS? How do these sequences behave when normalised? The
    distinction is not necessarily theoretical because in some languages
    (certainly in Greek although I guess there is no ambiguity with umlaut
    there) a diaeresis indicating separation can co-occur with other
    accents. The German bibliographers need guidance on how to convert such
    combinations to Unicode while preserving the distinction from umlaut.

    >>it may as well ignore the
    >>identical problems involved in using variation selectors, also combining
    >>class zero, with combining marks.
    >What you have been suggesting to do, however, *does* advocate
    >ignoring the problems involved in attempting to use variation
    >selectors to formally distinguish variants of combining marks.
    No, I have attempted to deal with these issues, in the old thread on
    "Variation selectors and vowel marks", and have described in some detail
    what might be done in situations where the modified combining mark and
    another mark are on the same base character. I accept that I did not
    find a fully satisfactory solution, but I certainly did not ignore the
    problem. But the umlaut/tréma proposal fails to discuss this problem at
    all and so can reasonably be accused of ignoring it.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Jul 14 2004 - 14:10:31 CDT