Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: James E. Agenbroad (
Date: Mon Sep 23 2002 - 11:22:51 EDT

  • Next message: James E. Agenbroad: "Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"

    On Fri, 20 Sep 2002, Kenneth Whistler wrote:

    > Peter said:
    > > >This stuff *can* all be handled with appropriately designed
    > > >ligations in fonts, so there are options for display:
    > > >
    > > ><U+0074, U+0361, U+0073, U+0307>
    > > >
    > > > ==>
    > > > maps via ligation table to:
    > > >
    > > >{t-s-tie-ligature-with-dot-above} glyph
    > >
    > > I would consider this an anomolous rendering. It is counter-exemplified by
    > > figure 7-6 in TUS3.0. I'd be concerned of longer-term problems if we
    > > decided to say that this was a valid alternate rendering from
    > >
    > > >{t-s-dot-tie-ligature} glyph
    > Well, yes, it would be anomalous, which is why it would require
    > somebody to go to the trouble to make a special ligation table
    > entry for it.
    > But what longer-term problems are you talking about? I didn't
    > say we should put in a formal rendering *rule* in the Unicode
    > Standard that says something different from Figure 7-6, along
    > the lines of converting one form to the other as above.
    > Look, let's consider again what problem we are trying to solve
    > here. We have two funky forms from the ALA-LC transliteration
    > tables, for which we haven't heard back yet from bibliographic
    > sources whether there actually is any *actual* data representation
    > problem in USMARC records.
    > We can try to invent and promulgate a generic rendering solution
    > for these cases (and anything like them) in the Unicode Standard,
    > despite the fact that they are an edge case of an edge case for
    > Latin script rendering... Or, if it turns out that it isn't a
    > general-enough problem to force everyone to deal with it in terms
    > of generic rendering, we could suggest alternatives:
    > a. markup solutions
    > b. specific font ligation solutions for specialized data
    > Now consider again the function of these things in the ALA-LC
    > transliteration. The Cyrillic transliteration recommendations
    > make rather extensive use of ligature ties. Why? Because the
    > ALA-LC transliteration schemes make some effort to be round-trippable.
    > In other words, the Cyrillic transliteration they recommend is
    > not merely a useful romanization that might be in more general
    > use, as for a newspaper, but is a romanization from which, in
    > principle, you ought to be able to recover the Cyrillic it
    > was transliterated from. Thus these schemes distinguish t-s
    > from t-s-tie-ligature, since the ligated form might be a
    > transliteration of a tse or similar letter, whereas the t-s
    > would be a transliteration of a te+es, and so on. In other
    > words, the tie-ligatures are being sprinkled in to make ad hoc
    > digraphs for the transliteration, to aid in recovery of the
    > Cyrillic from the romanization.
    > Now the dots above typically represent an articulatory diacritic,
    > as for palatalization, or the like.
    > So the combination of the two is to indicate: we are transliterating
    > a letter with a palatal (say) diacritic, using a digraph.
    > Do we have alternatives in Unicode for that? Well, yes, depending
    > on whether the problem is:
    > a. enabling exact transcoding of the USMARC data records
    > using ALA-LC romanization recommendations and the ANSEL
    > character set, for interoperability with Unicode systems.
    > or
    > b. typesetting the ALA-LC romanization document guide in
    > Unicode, treating all the data therein as plain text and
    > using generic Unicode rendering rules.
    > I contend that the primary problem is a), and that we ought
    > to examine the general usefulness of this dot-above-double-diacritic
    > and related rendering, before we insist it has to be representable
    > in plain text and go looking for an encoding solution and specify a
    > bunch of rendering rules for it.
    > If the essential requirement here is to capture the data
    > functionality of the transliteration: a roundtrippable form,
    > with a palatal diacritic, using a digraph, we could suggest,
    > for instance:
    > <U+0074, U+034F, U+0073, U+0307>
    > or
    > <U+0074, U+0307, U+034F, U+0073>
    > where we end up with an explicitly indicated digraph, with a
    > dot-above diacritic (pick which letter you want it on), as
    > a grapheme cluster. This is distinct from:
    > <U+0074, U+0073, U+0307>
    > or
    > <U+0074, U+0307, U+0073>
    > so you have your transliteration round-trippability intact.
    > And for your special-purpose application, which is a Unicode system
    > to display USMARC bibliographic records using the ALA-LC romanization
    > presentation conventions, you add ligation entries to your font
    > so that
    > <U+0074, U+034F, U+0073, U+0307>
    > and similar forms using a U+034F GRAPHEME JOINER display with a
    > visible tie-ligature, rather than nothing, despite the fact that
    > no U+0361 double diacritic is being used in the data. Problem
    > solved.
    > Of course, that doesn't mean that your converted USMARC data
    > records involving digraphs for Cyrillic transliteration will
    > display with the tie-ligature in a generic web application using
    > off-the-shelf fonts -- but is that the problem we are trying
    > to solve here? I doubt it. The forms would be legible -- perhaps
    > more legible without the obtrusive ties cluttering them up --
    > and the data distinctions would still be preserved in such
    > contexts.
    > --Ken

              Jim Agenbroad ( )
         "It is not true that people stop pursuing their dreams because they
    grow old, they grow old because they stop pursuing their dreams." Adapted
    from a letter by Gabriel Garcia Marquez.
         The above are purely personal opinions, not necessarily the official
    views of any government or any agency of any.
         Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
    mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE,
    Washington, D.C. 20540-9334 U.S.A.
    Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.

    This archive was generated by hypermail 2.1.5 : Mon Sep 23 2002 - 12:08:41 EDT