Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

From: Peter_Constable@sil.org
Date: Sat Sep 21 2002 - 11:15:32 EDT

Next message: PRANI6@Bertelsmann.de: "entities with breve"

Previous message: Kenneth Whistler: "Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"
Maybe in reply to: William Overington: "Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 09/20/2002 07:07:30 PM Kenneth Whistler wrote:

>Well, yes, it would be anomalous, which is why it would require
>somebody to go to the trouble to make a special ligation table
>entry for it.
>
>But what longer-term problems are you talking about?

I'm saying that *if* there is a need for digitial data representation of
the things in the ALA-LC transliteration (which, like you, I consider not
to have yet been demonstrated), then I wouldn't want to suggest it can be
represented as the sequence

> ><U+0074, U+0361, U+0073, U+0307>

since that has an existing, distinct presentation specified by the
Standard, viz.

> >{t-s-dot-tie-ligature} glyph

and it could create problems to have two distinct text forms having the
same encoded representation.

>I didn't
>say we should put in a formal rendering *rule* in the Unicode
>Standard that says something different from Figure 7-6, along
>the lines of converting one form to the other as above.

Well, it isn't clear to me what you *are* intending to convey. You said

> >This stuff *can* all be handled with appropriately designed
> >ligations in fonts, so there are options for display:
> >
> ><U+0074, U+0361, U+0073, U+0307>
> >
> > ==>
> > maps via ligation table to:
> >
> >{t-s-tie-ligature-with-dot-above} glyph

which sounded to me like suggesting an alternate rule for rendering that
encoded sequence. If you're merely suggesting someone *could* create a
custom rendering not specifically sanctioned by the Standard, I still
wouldn't be comfortable with the suggestion as expressed (especially by an
officer of the Consortium) as it could lead to some user body implementing
that on a widespread basis and using that encoded representation in
interchange under the assumption that the Standard permitted it ("it didn't
seem to explicitly disallow it"). But that could lead to conflict with
others' implementations that assume the rendering which *is* explicitly
sanctioned. It needs to be understood by all that such a rendering rule is
non-standard, and that it should be assumed that others will not interpret
that encoded sequence in that way.

>Look, let's consider again what problem we are trying to solve
>here. We have two funky forms from the ALA-LC transliteration
>tables, for which we haven't heard back yet from bibliographic
>sources whether there actually is any *actual* data representation
>problem in USMARC records.

Agreed.

>We can try to invent and promulgate a generic rendering solution
>for these cases (and anything like them) in the Unicode Standard,
>despite the fact that they are an edge case of an edge case for
>Latin script rendering... Or, if it turns out that it isn't a
>general-enough problem to force everyone to deal with it in terms
>of generic rendering, we could suggest alternatives:
>
> a. markup solutions
> b. specific font ligation solutions for specialized data

If it really is an edge case of an edge case, and the only need to present
it were in situations such as, "At one time, someone even suggested a
transliterated representation of this as shown here..." Then I'd opt for a,
or a graphic, even. Of course, someone might even do b. But my only concern
is that anyone who does such a font implementation should understand that
they are creating a customised, non-standard rendering implementation, and
that they shouldn't expect the encoded sequence < 0074, 0361, 0073, 0307 >
would be understood by *any other process* to mean that text element.

>Now consider again the function of these things in the ALA-LC
>transliteration. The Cyrillic transliteration recommendations
>make rather extensive use of ligature ties. Why? Because the
>ALA-LC transliteration schemes make some effort to be round-trippable.
>In other words, the Cyrillic transliteration they recommend is
>not merely a useful romanization that might be in more general
>use, as for a newspaper, but is a romanization from which, in
>principle, you ought to be able to recover the Cyrillic it
>was transliterated from.

Yes. They (well, at least, TC 46) even go to pains to formally define
"transliteration" to mean only things that are round-trippable. It's fine
to make explicit what they mean in their standards, but we also know that
there are many systems that are commonly known as "transliterations" (and
some of them, de facto standards) that are not round-trippable.

[snip]

>Do we have alternatives in Unicode for that? Well, yes, depending
>on whether the problem is:
>
> a. enabling exact transcoding of the USMARC data records
> using ALA-LC romanization recommendations and the ANSEL
> character set, for interoperability with Unicode systems.
>
>or
>
> b. typesetting the ALA-LC romanization document guide in
> Unicode, treating all the data therein as plain text and
> using generic Unicode rendering rules.
>
>I contend that the primary problem is a), and that we ought
>to examine the general usefulness of this dot-above-double-diacritic
>and related rendering, before we insist it has to be representable
>in plain text and go looking for an encoding solution and specify a
>bunch of rendering rules for it.

I agree.

>If the essential requirement here is to capture the data
>functionality of the transliteration: a roundtrippable form,
>with a palatal diacritic, using a digraph, we could suggest,
>for instance:
>
><U+0074, U+034F, U+0073, U+0307>
>
>or
>
><U+0074, U+0307, U+034F, U+0073>
>
>where we end up with an explicitly indicated digraph,

Yes, in encoded representation, though it would have a distinct appearance
in rendering (but such a need hasn't been assumed).

>And for your special-purpose application, which is a Unicode system
>to display USMARC bibliographic records using the ALA-LC romanization
>presentation conventions, you add ligation entries to your font
>so that
>
><U+0074, U+034F, U+0073, U+0307>
>
>and similar forms using a U+034F GRAPHEME JOINER display with a
>visible tie-ligature, rather than nothing, despite the fact that
>no U+0361 double diacritic is being used in the data. Problem
>solved.

Again, so long as you don't assume you can interchange these sequences and
have them interpreted in the same way (in the absense of a higher-level
protocol assumed by both parties by prior agreement).

>Of course, that doesn't mean that your converted USMARC data
>records involving digraphs for Cyrillic transliteration will
>display with the tie-ligature in a generic web application using
>off-the-shelf fonts -- but is that the problem we are trying
>to solve here? I doubt it.

Agreed. My original objection was exactly because this bit wasn't stated.

And I'd go a step further: if we *did* decide one day that that problem
does need to solved, then either some distinct encoding mechanism or some
standardised solution in markup will be needed -- but I'm not assuming we
will actually come to that point.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

Next message: PRANI6@Bertelsmann.de: "entities with breve"
Previous message: Kenneth Whistler: "Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"
Maybe in reply to: William Overington: "Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 21 2002 - 13:32:20 EDT