Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 20 2002 - 20:07:30 EDT

  • Next message: Peter_Constable@sil.org: "Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"

    Peter said:

    > >This stuff *can* all be handled with appropriately designed
    > >ligations in fonts, so there are options for display:
    > >
    > ><U+0074, U+0361, U+0073, U+0307>
    > >
    > > ==>
    > > maps via ligation table to:
    > >
    > >{t-s-tie-ligature-with-dot-above} glyph
    >
    > I would consider this an anomolous rendering. It is counter-exemplified by
    > figure 7-6 in TUS3.0. I'd be concerned of longer-term problems if we
    > decided to say that this was a valid alternate rendering from
    >
    > >{t-s-dot-tie-ligature} glyph

    Well, yes, it would be anomalous, which is why it would require
    somebody to go to the trouble to make a special ligation table
    entry for it.

    But what longer-term problems are you talking about? I didn't
    say we should put in a formal rendering *rule* in the Unicode
    Standard that says something different from Figure 7-6, along
    the lines of converting one form to the other as above.

    Look, let's consider again what problem we are trying to solve
    here. We have two funky forms from the ALA-LC transliteration
    tables, for which we haven't heard back yet from bibliographic
    sources whether there actually is any *actual* data representation
    problem in USMARC records.

    We can try to invent and promulgate a generic rendering solution
    for these cases (and anything like them) in the Unicode Standard,
    despite the fact that they are an edge case of an edge case for
    Latin script rendering... Or, if it turns out that it isn't a
    general-enough problem to force everyone to deal with it in terms
    of generic rendering, we could suggest alternatives:

       a. markup solutions
       b. specific font ligation solutions for specialized data

    Now consider again the function of these things in the ALA-LC
    transliteration. The Cyrillic transliteration recommendations
    make rather extensive use of ligature ties. Why? Because the
    ALA-LC transliteration schemes make some effort to be round-trippable.
    In other words, the Cyrillic transliteration they recommend is
    not merely a useful romanization that might be in more general
    use, as for a newspaper, but is a romanization from which, in
    principle, you ought to be able to recover the Cyrillic it
    was transliterated from. Thus these schemes distinguish t-s
    from t-s-tie-ligature, since the ligated form might be a
    transliteration of a tse or similar letter, whereas the t-s
    would be a transliteration of a te+es, and so on. In other
    words, the tie-ligatures are being sprinkled in to make ad hoc
    digraphs for the transliteration, to aid in recovery of the
    Cyrillic from the romanization.

    Now the dots above typically represent an articulatory diacritic,
    as for palatalization, or the like.

    So the combination of the two is to indicate: we are transliterating
    a letter with a palatal (say) diacritic, using a digraph.

    Do we have alternatives in Unicode for that? Well, yes, depending
    on whether the problem is:

      a. enabling exact transcoding of the USMARC data records
         using ALA-LC romanization recommendations and the ANSEL
         character set, for interoperability with Unicode systems.

    or

      b. typesetting the ALA-LC romanization document guide in
         Unicode, treating all the data therein as plain text and
         using generic Unicode rendering rules.

    I contend that the primary problem is a), and that we ought
    to examine the general usefulness of this dot-above-double-diacritic
    and related rendering, before we insist it has to be representable
    in plain text and go looking for an encoding solution and specify a
    bunch of rendering rules for it.

    If the essential requirement here is to capture the data
    functionality of the transliteration: a roundtrippable form,
    with a palatal diacritic, using a digraph, we could suggest,
    for instance:

    <U+0074, U+034F, U+0073, U+0307>

    or

    <U+0074, U+0307, U+034F, U+0073>

    where we end up with an explicitly indicated digraph, with a
    dot-above diacritic (pick which letter you want it on), as
    a grapheme cluster. This is distinct from:

    <U+0074, U+0073, U+0307>

    or

    <U+0074, U+0307, U+0073>

    so you have your transliteration round-trippability intact.

    And for your special-purpose application, which is a Unicode system
    to display USMARC bibliographic records using the ALA-LC romanization
    presentation conventions, you add ligation entries to your font
    so that

    <U+0074, U+034F, U+0073, U+0307>

    and similar forms using a U+034F GRAPHEME JOINER display with a
    visible tie-ligature, rather than nothing, despite the fact that
    no U+0361 double diacritic is being used in the data. Problem
    solved.

    Of course, that doesn't mean that your converted USMARC data
    records involving digraphs for Cyrillic transliteration will
    display with the tie-ligature in a generic web application using
    off-the-shelf fonts -- but is that the problem we are trying
    to solve here? I doubt it. The forms would be legible -- perhaps
    more legible without the obtrusive ties cluttering them up --
    and the data distinctions would still be preserved in such
    contexts.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Sep 20 2002 - 20:51:45 EDT