Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

Date: Sat Sep 21 2002 - 11:15:32 EDT

  • Next message: "entities with breve"

    On 09/20/2002 07:07:30 PM Kenneth Whistler wrote:

    >Well, yes, it would be anomalous, which is why it would require
    >somebody to go to the trouble to make a special ligation table
    >entry for it.
    >But what longer-term problems are you talking about?

    I'm saying that *if* there is a need for digitial data representation of
    the things in the ALA-LC transliteration (which, like you, I consider not
    to have yet been demonstrated), then I wouldn't want to suggest it can be
    represented as the sequence

    > ><U+0074, U+0361, U+0073, U+0307>

    since that has an existing, distinct presentation specified by the
    Standard, viz.

    > >{t-s-dot-tie-ligature} glyph

    and it could create problems to have two distinct text forms having the
    same encoded representation.

    >I didn't
    >say we should put in a formal rendering *rule* in the Unicode
    >Standard that says something different from Figure 7-6, along
    >the lines of converting one form to the other as above.

    Well, it isn't clear to me what you *are* intending to convey. You said

    > >This stuff *can* all be handled with appropriately designed
    > >ligations in fonts, so there are options for display:
    > >
    > ><U+0074, U+0361, U+0073, U+0307>
    > >
    > > ==>
    > > maps via ligation table to:
    > >
    > >{t-s-tie-ligature-with-dot-above} glyph

    which sounded to me like suggesting an alternate rule for rendering that
    encoded sequence. If you're merely suggesting someone *could* create a
    custom rendering not specifically sanctioned by the Standard, I still
    wouldn't be comfortable with the suggestion as expressed (especially by an
    officer of the Consortium) as it could lead to some user body implementing
    that on a widespread basis and using that encoded representation in
    interchange under the assumption that the Standard permitted it ("it didn't
    seem to explicitly disallow it"). But that could lead to conflict with
    others' implementations that assume the rendering which *is* explicitly
    sanctioned. It needs to be understood by all that such a rendering rule is
    non-standard, and that it should be assumed that others will not interpret
    that encoded sequence in that way.

    >Look, let's consider again what problem we are trying to solve
    >here. We have two funky forms from the ALA-LC transliteration
    >tables, for which we haven't heard back yet from bibliographic
    >sources whether there actually is any *actual* data representation
    >problem in USMARC records.


    >We can try to invent and promulgate a generic rendering solution
    >for these cases (and anything like them) in the Unicode Standard,
    >despite the fact that they are an edge case of an edge case for
    >Latin script rendering... Or, if it turns out that it isn't a
    >general-enough problem to force everyone to deal with it in terms
    >of generic rendering, we could suggest alternatives:
    > a. markup solutions
    > b. specific font ligation solutions for specialized data

    If it really is an edge case of an edge case, and the only need to present
    it were in situations such as, "At one time, someone even suggested a
    transliterated representation of this as shown here..." Then I'd opt for a,
    or a graphic, even. Of course, someone might even do b. But my only concern
    is that anyone who does such a font implementation should understand that
    they are creating a customised, non-standard rendering implementation, and
    that they shouldn't expect the encoded sequence < 0074, 0361, 0073, 0307 >
    would be understood by *any other process* to mean that text element.

    >Now consider again the function of these things in the ALA-LC
    >transliteration. The Cyrillic transliteration recommendations
    >make rather extensive use of ligature ties. Why? Because the
    >ALA-LC transliteration schemes make some effort to be round-trippable.
    >In other words, the Cyrillic transliteration they recommend is
    >not merely a useful romanization that might be in more general
    >use, as for a newspaper, but is a romanization from which, in
    >principle, you ought to be able to recover the Cyrillic it
    >was transliterated from.

    Yes. They (well, at least, TC 46) even go to pains to formally define
    "transliteration" to mean only things that are round-trippable. It's fine
    to make explicit what they mean in their standards, but we also know that
    there are many systems that are commonly known as "transliterations" (and
    some of them, de facto standards) that are not round-trippable.


    >Do we have alternatives in Unicode for that? Well, yes, depending
    >on whether the problem is:
    > a. enabling exact transcoding of the USMARC data records
    > using ALA-LC romanization recommendations and the ANSEL
    > character set, for interoperability with Unicode systems.
    > b. typesetting the ALA-LC romanization document guide in
    > Unicode, treating all the data therein as plain text and
    > using generic Unicode rendering rules.
    >I contend that the primary problem is a), and that we ought
    >to examine the general usefulness of this dot-above-double-diacritic
    >and related rendering, before we insist it has to be representable
    >in plain text and go looking for an encoding solution and specify a
    >bunch of rendering rules for it.

    I agree.

    >If the essential requirement here is to capture the data
    >functionality of the transliteration: a roundtrippable form,
    >with a palatal diacritic, using a digraph, we could suggest,
    >for instance:
    ><U+0074, U+034F, U+0073, U+0307>
    ><U+0074, U+0307, U+034F, U+0073>
    >where we end up with an explicitly indicated digraph,

    Yes, in encoded representation, though it would have a distinct appearance
    in rendering (but such a need hasn't been assumed).

    >And for your special-purpose application, which is a Unicode system
    >to display USMARC bibliographic records using the ALA-LC romanization
    >presentation conventions, you add ligation entries to your font
    >so that
    ><U+0074, U+034F, U+0073, U+0307>
    >and similar forms using a U+034F GRAPHEME JOINER display with a
    >visible tie-ligature, rather than nothing, despite the fact that
    >no U+0361 double diacritic is being used in the data. Problem

    Again, so long as you don't assume you can interchange these sequences and
    have them interpreted in the same way (in the absense of a higher-level
    protocol assumed by both parties by prior agreement).

    >Of course, that doesn't mean that your converted USMARC data
    >records involving digraphs for Cyrillic transliteration will
    >display with the tie-ligature in a generic web application using
    >off-the-shelf fonts -- but is that the problem we are trying
    >to solve here? I doubt it.

    Agreed. My original objection was exactly because this bit wasn't stated.

    And I'd go a step further: if we *did* decide one day that that problem
    does need to solved, then either some distinct encoding mechanism or some
    standardised solution in markup will be needed -- but I'm not assuming we
    will actually come to that point.

    - Peter

    Peter Constable

    Non-Roman Script Initiative, SIL International
    7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
    Tel: +1 972 708 7485
    E-mail: <>

    This archive was generated by hypermail 2.1.5 : Sat Sep 21 2002 - 13:32:20 EDT