Re: Umlaut and Tréma, was: Variation sele ctors and vowel marks

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 14 2004 - 12:40:13 CDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Re: Importance of diacritics"

    Peter Kirk wrote:

    > > At 11:02 AM 7/13/2004, Peter Kirk wrote:
    > >
    > >> I was surprised to see that WG2 has accepted a proposal made by the
    > >> US National Body to use CGJ to distinguish between Umlaut and Tréma
    > >> in German bibliographic data.

    And Asmus responded:

    > > You raise some interesting questions. However, note that the purpose
    > > of CGJ is intended for sorting related distinctions, which are at
    > > issue here. This is different from variation selectors which are
    > > intended to be used for displayed variations.

    Note that the problem for German bibliographic records of
    distinguishing umlaut from tréma was a longstanding issue for
    the German national body, and was blocking them from cutover of
    German bibliographic systems from ISO 5429 implementations to
    Unicode-based implementations.

    The proposal that the U.S. national body made met the technical
    requirements that the German national body had, breaking this
    logjam. And unlike the original German proposal, it did not
    have massive consequences for the representation of umlaut in
    other data and for interoperating with German bibliographic systems.

    So the fact that the proposal was acceptable and accepted by WG2
    should not be too surprising. It solved a data representation
    problem in a manner acceptable to all parties involved.
     
    > OK. But this is not a unique case. For example, in Hebrew Silluq and
    > Meteg, Dagesh and Shuruq are pairs of different marks which share a
    > glyph and so a Unicode character but may need to be distinguished for
    > certain processes.

    Can you show a pre-existing ISO character encoding standard, such
    as ISO 5429, for which there are bibliographic implementations
    whose conversion to Unicode is blocked by an encoding distinction
    not maintained in Unicode for these particular cases? If so, then
    you would have an analogous situation. If not, then you are simply
    talking about functional distinctions for the same encoded diacritic,
    which might be needed to be maintained for some kinds of processing,
    for which people can use whatever kinds of conventions they sit
    fit to deal with the issue -- but the issue doesn't rise to the
    level of an encoding issue requiring formal intervention by WG2,
    in my opinion.

    This is a little like noting that U+0301 COMBINING ACUTE ACCENT,
    when applied to Latin letters, might under some circumstances
    represent a stress, under others a pitch accent, under others a
    formal tonemic distinctions, under others a vocalic length
    distinction, and under others a change in vowel quality. Such
    distinctions might be relevant to many different kinds of
    textual processing concerned with linguistic effects, but it
    is not a character encoding issue.

    > Should similar encodings with CGJ be proposed to make
    > these distinctions?

    If formal maintenance of a collation distinction between two
    otherwise identically *appearing* pieces of text -- based on
    whatever analytic status of the text is relevant -- is at issue,
    then representation of one sequence with CGJ and one without
    is a recommended way by the Unicode Standard to introduce a
    distinction which a tailored collation can then weight differently
    to get the required collation difference.

    > So I must agree with Doug that
    > "CGJ + COMBINING DIAERESIS is a hack".

    It is simply a way to maintain a distinction needed for German
    bibliographic data to behave as required, while representing
    their data in Unicode. Call it a hack if you like, but it
    satisfied the relevant parties as an appropriate means for
    representing the data in question.

    > 256 variation selectors won't do if they have all been defined
    > unchangeably with the wrong properties e.g combining class. On the other
    > hand, if the UTC is prepared to ignore the combining class and
    > normalisation problems involved in using one combining class zero
    > character, CGJ, to modify a combining mark,

    This completely misconstrues the solution in question for the
    German umlaut and tréma in bibliographic records. The CGJ is
    not introduced "to modify a combining mark". Instead, two
    text elements required to be distinguished in German bibliographic
    data are represented by two distinct sequences:

    X + COMBINING DIAERESIS
    X + CGJ + COMBINING DIAERESIS

    This is completely in keeping with the intent of the CGJ in the
    standard, and the proposal did not, in any way, "ignore the
    combining class and normalisation problems" in this case.
    ... Which, by the way, is why the solution met with unanimous
    approval in WG2, without objection from the UTC liaison.

    > it may as well ignore the
    > identical problems involved in using variation selectors, also combining
    > class zero, with combining marks.

    What you have been suggesting to do, however, *does* advocate
    ignoring the problems involved in attempting to use variation
    selectors to formally distinguish variants of combining marks.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 14 2004 - 12:41:06 CDT