Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 18 2002 - 17:08:55 EDT

  • Next message: starner@okstate.edu: "Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"

    William Overington asked:

    > In the discussion about romanization of Cyrillic ligatures I asked how one
    > expresses in Unicode the ts ligature with a dot above.
    >
    > Regarding Ken's response to the Byzantine legal codes matter, it would
    > appear possible that the way that the ts ligature with a dot above for
    > romanization of Cyrillic could be represented in Unicode would be by the
    > following sequence.
    >
    > t U+FE20 s U+FE21 U+0307
    >
    > The ordinary ts ligature for romanization of Cyrillic being expressed as
    > follows.
    >
    > t U+FE20 s U+FE21
    >

    As Peter indicated, the preferred way to represent this graphic ligature
    tie in Unicode is with the double diacritics, i.e.:

    t U+0361 s

    U+FE20 and U+FE21 are compatibility characters, for interoperation,
    in particular, with the USMARC catalog records using the Extended
    Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). See:

    http://lcweb.loc.gov/catdir/cpso/romanization/charsets/pdf

    > It appears to me that the ts ligature with a dot above, and a similar ng
    > ligature with a dot above, are already needed for the Library of Congress
    > romanization of Cyrillic system.
    >
    > The following directory contains a lot of pdf files.
    >
    > http://lcweb.loc.gov/catdir/cpso/romanization
    >
    > The ts ligature with a dot above can be found on page 2 of the nonslav.pdf
    > file. The ng ligature with a dot above can be found on page 13 of the same
    > file.

    And, in particular, the ts ligature with a dot above is for an Abkhaz
    romanization, and the ng ligature with a dot above is for an obsolete
    Mansi (related to Khanty) romanization. I suspect their actual use
    is pretty limited.

    >
    > Capital letter versions of the two ligatures are needed as well.

    Well, this is interesting, since these were *added*, systematically,
    to the 1997 version of the ALA-LC non-Slavic romanization systems. The
    1990 version did not have them.

    That raises the question of whether these were simply editorial
    extensions, or were actually *needed* for some bibliographical
    data. I consider it unlikely that all of the capital forms were
    suddenly discovered between 1990 and 1997 and that a whole bunch
    of USMARC bibliographical records making use of the capital forms
    were created during that interval.

    In this regard, one should *read* the ALA-LC document. See charsets.pdf:

    "The transliterations produced by applying ALA-LC Romanization Tables
    are encoded in machine-readable form into USMARC records. Encoding of
    the basic Latin alphabet, special characters, and character modifiers
    listed in this publication is done in USMARC records following two
    American National Standards; the Code for Information Interchange
    (ASCII) (ANSI X3.4), and the Extended Latin Alphabet Coded Character
    Set for Bibliographic Use (ANSEL) (ANSI Z39.47). Each character
    is assigned a unique hexadecimal (base-16) code which identifies it
    unambiguously for computer processing."

    The current version of how that is done is the "MARC 21 Specifications
    for Record Structure, Character Sets, and Exchange Media." Among other
    things, that specification spells out how the combining marks are used with base
    characters in USMARC records.

    I don't know, however, if any provision was actually made in MARC 21
    for these instances of ligature ties with dots above, however. Perhaps
    someone familiar with the details of USMARC can answer that.

    The USMARC records (using ANSEL) *would*, however, be making use
    of the half ligature characters:

    0xEB LIGATURE, FIRST HALF
    0xEC LIGATURE, SECOND HALF

    as well as:

    0xE7 SUPERIOD [sic] DOT (s.b. "SUPERIOR DOT")

    It just isn't clear exactly what order these would occur in any
    hypothetical USMARC record actually using either the Abkhaz or
    Mansi romanizations in question.

    > I wonder if consideration could please be given as to whether this matter
    > should be left unregulated or whether some level of regulation should be
    > used.

    I think this should depend first on a determination of whether there
    is a demonstrated need for an actual representation of these sequences --
    which ought to be determined by the people responsible for the
    data stores which might contain them, namely the online bibliographic
    community.

    The ALA-LC conventions are not the only alternatives available for
    representation of Abkhaz and/or Khanty/Mansi data in romanization.
    In fact, you can find such data on the web using alternative
    romanizations. So it isn't as if the current gap in figuring out
    precisely how, in Unicode, to represent a double diacritic with
    another diacritic applied outside the visible double diacritic
    on a digraph is preventing anyone from using romanized Abkhaz or
    Khanty/Mansi data in interchange.

    --Ken

    >
    > William Overington
    >
    > 18 September 2002



    This archive was generated by hypermail 2.1.2 : Wed Sep 18 2002 - 17:58:32 EDT