Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 18 2002 - 21:39:23 EDT

  • Next message: roslyn jose: "about starting off"

    > >The ALA-LC conventions are not the only alternatives available for
    > >representation of Abkhaz and/or Khanty/Mansi data in romanization.
    > >In fact, you can find such data on the web using alternative
    > >romanizations. So it isn't as if the current gap in figuring out
    > >precisely how, in Unicode, to represent a double diacritic with
    > >another diacritic applied outside the visible double diacritic
    > >on a digraph is preventing anyone from using romanized Abkhaz or
    > >Khanty/Mansi data in interchange.
    >
    > By the same argument, Unicode might as well stop taking new characters;
    > surely, between the 500 Latin characters and dozens of punctuation marks
    > and combining characters and the other 70,000 characters, you can find
    > a way to communicate whatever language or data you need communicated.

    Of course. Let them use ASCII, for that matter.

    But that wasn't my point. There is no particular evidence
    that the ALA-LC conventions with the dot above the graphic
    ligature ties is in widespread use for romanizations of these
    particular languages, that I can see. So the *urgency* of
    solving this problem isn't there, unless the LC/library/bibliographic
    community comes to the UTC and indicates that they have a data interchange
    problem with USMARC records using ANSEL that requires a clear
    representation solution in Unicode. And before we go there, I'd
    like to have a clear specification of how it works in USMARC
    records, so we would know how to do the following conversion:

        USMARC <--> Unicode

    for the two forms in question.

    The 1990 version of the LC romanizations for this non-Slavic stuff
    used all kinds of hand-drawn forms. And even the 1997 version of
    the ALA-LC document is photo-offset from pages that include various
    kinds of pasteup from who-knows-what sources, including some
    hand-drawn, with at least one of these dots above being added by
    hand. So it isn't clear that there is any connection between the
    ALA-LC document text and the ANSEL character encoding actually used
    in the USMARC records; this could be arbitrary markup with some
    system like TEX for publication.

    BTW, if we are blueskying about this topic, the *elegant* way
    to resolve this would be to recategorize all the double
    diacritics as *enclosing* combining marks (Me), rather than
    Mn, and then rewriting the conventions for their use to
    match those of the enclosing circle and such. Then they
    would subtend (or supertend) any grapheme cluster, including
    arbitrary digraphs indicated with a COMBINING GRAPHEME JOINER
    character. And a dot above would neatly apply to the entire
    subtended cluster, as for circled characters, and so on.
    Of course, that would invalidate anybody's current
    usage of the characters. Oh well, you can't win 'em all.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Sep 18 2002 - 22:25:18 EDT