Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: James E. Agenbroad (
Date: Mon Sep 23 2002 - 12:14:40 EDT

  • Next message: Jason: "problem with php"

         My comment is inserted after the first paragraph.

    On Fri, 20 Sep 2002, Kenneth Whistler wrote:
    > Look, let's consider again what problem we are trying to solve
    > here. We have two funky forms from the ALA-LC transliteration
    > tables, for which we haven't heard back yet from bibliographic
    > sources whether there actually is any *actual* data representation
    > problem in USMARC records.

    Jim Agenbroad: As of March 14, 1992, (latest data readily available) in
    the MARC Books file of 3,279,507 records there were six occurences of the
    combination lower case 't' with dot over and ligature, first half. There
    were none with:
     1. 'T' with dot over followed by ligature, first half,
     2. nor other letters (upper or lower case) with dot over followed by
         either half of the ligature,
     3. nor with any letter (upper or lower case) and either half of the
         ligature before to dot over.
    Others may say how adequate the following solution would be or how
    important having one is. Here I just document that the dot over the
    ligature does occur with 't' which is consistent with its expected use for
    romanizing Abkhaz, presumably letter "Cyrillic small ligature te
    tse" (U+04B5) though the position of its descender differs between ALA/LC
    (page 139) and Unicode 3.0 (pages 377, 381).
              Jim Agenbroad (disclaimer and addresses at bottom)
    > Now consider again the function of these things in the ALA-LC
    > transliteration. The Cyrillic transliteration recommendations
    > make rather extensive use of ligature ties. Why? Because the
    > ALA-LC transliteration schemes make some effort to be round-trippable.
    > In other words, the Cyrillic transliteration they recommend is
    > not merely a useful romanization that might be in more general
    > use, as for a newspaper, but is a romanization from which, in
    > principle, you ought to be able to recover the Cyrillic it
    > was transliterated from. Thus these schemes distinguish t-s
    > from t-s-tie-ligature, since the ligated form might be a
    > transliteration of a tse or similar letter, whereas the t-s
    > would be a transliteration of a te+es, and so on. In other
    > words, the tie-ligatures are being sprinkled in to make ad hoc
    > digraphs for the transliteration, to aid in recovery of the
    > Cyrillic from the romanization.
    > Now the dots above typically represent an articulatory diacritic,
    > as for palatalization, or the like.
    > So the combination of the two is to indicate: we are transliterating
    > a letter with a palatal (say) diacritic, using a digraph.
    > Do we have alternatives in Unicode for that? Well, yes, depending
    > on whether the problem is:
    > a. enabling exact transcoding of the USMARC data records
    > using ALA-LC romanization recommendations and the ANSEL
    > character set, for interoperability with Unicode systems.
    > or
    > b. typesetting the ALA-LC romanization document guide in
    > Unicode, treating all the data therein as plain text and
    > using generic Unicode rendering rules.
    > I contend that the primary problem is a), and that we ought
    > to examine the general usefulness of this dot-above-double-diacritic
    > and related rendering, before we insist it has to be representable
    > in plain text and go looking for an encoding solution and specify a
    > bunch of rendering rules for it.
    > If the essential requirement here is to capture the data
    > functionality of the transliteration: a roundtrippable form,
    > with a palatal diacritic, using a digraph, we could suggest,
    > for instance:
    > <U+0074, U+034F, U+0073, U+0307>
    > or
    > <U+0074, U+0307, U+034F, U+0073>
    > where we end up with an explicitly indicated digraph, with a
    > dot-above diacritic (pick which letter you want it on), as
    > a grapheme cluster. This is distinct from:
    > <U+0074, U+0073, U+0307>
    > or
    > <U+0074, U+0307, U+0073>
    > so you have your transliteration round-trippability intact.
    > And for your special-purpose application, which is a Unicode system
    > to display USMARC bibliographic records using the ALA-LC romanization
    > presentation conventions, you add ligation entries to your font
    > so that
    > <U+0074, U+034F, U+0073, U+0307>
    > and similar forms using a U+034F GRAPHEME JOINER display with a
    > visible tie-ligature, rather than nothing, despite the fact that
    > no U+0361 double diacritic is being used in the data. Problem
    > solved.
    > Of course, that doesn't mean that your converted USMARC data
    > records involving digraphs for Cyrillic transliteration will
    > display with the tie-ligature in a generic web application using
    > off-the-shelf fonts -- but is that the problem we are trying
    > to solve here? I doubt it. The forms would be legible -- perhaps
    > more legible without the obtrusive ties cluttering them up --
    > and the data distinctions would still be preserved in such
    > contexts.
    > --Ken

              Jim Agenbroad ( )
         "It is not true that people stop pursuing their dreams because they
    grow old, they grow old because they stop pursuing their dreams." Adapted
    from a letter by Gabriel Garcia Marquez.
         The above are purely personal opinions, not necessarily the official
    views of any government or any agency of any.
         Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
    mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE,
    Washington, D.C. 20540-9334 U.S.A.
    Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.

    This archive was generated by hypermail 2.1.5 : Mon Sep 23 2002 - 12:59:37 EDT