Re: missing characters: combining marks above runs of more than 2 base letters

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 19 Nov 2011 03:44:36 +0100

2011/11/19 Ken Whistler <kenw_at_sybase.com>:
> On 11/18/2011 5:24 PM, Philippe Verdy wrote:
>>
>> This arc in the example is definitely NOT mathematics
>
> Nor did I say it was.
>
>> (even if you
>> have read a version where it was attempted to represent it using a
>> Math TeX notation in this page, an obvious error because it used an
>> angular \widehat and not the appropriate sign).
>
> Irrelevant.
>
>>  This arc is a true
>> phonetic mark of a contextual elision (the intermediate letter(s) are
>> not to be pronounced, even though they are still written to explicit
>> the phonetically elided word(s) and keep their usual orthography).
>
> The fact that the function of the mark is to indicate a contextual elision
> is
> also essentially irrelevant to the analysis of whether such marking consists
> of a mark (character) in text or a mark-up (non-character) of text.
>
> The issue to pay attention to is whether the scoping of the modification of
> text is cleanly delimited to a single character at a time, or is in
> principle
> extensible across n characters.

Unicode encodes many things whose scope of modification applies to
more than one character. At first, it already defines grapheme
clusters...

>>
>> Exactly similar to other phonetic symbols like the elision tie (an arc
>> adjoininig two words to elide its separating space), or the apostrophe
>> (which replaces completely the elided letters).
>>
>> And obviously a true candidate for plain-text: it provides
>> simultaneouly two readings of the text, one is purely phonetic (and
>> accurate for poems that have an essential and very strong rythmic
>> structure), another is semantic (by the orthography kept). All letters
>> have to be present in some way, even if some of them are marked for
>> the expected phonetic.
>
> And is obviously *not* a true candidate for plain text representation. This
> kind
> of markup for simultaneous alternative readings of text is precisely where
> representation by a richer mechanism makes sense.

And this is contradicted within the Hebrew, Arabic and Tibetan script
where there are simulateous alternative readings marked by combining
signs, including for cantillation and songs.

> And this is merely the
> veriest toe in the water for what I am referring to as "text scoring".
>
> For an example of the complexity of various approaches to these kinds of
> problems,
> see:
>
> http://www.ilc.cnr.it/EAGLES/spokentx/node31.html
>
> And here is an example of a well worked-out, systematic, multi-level scoring
> system
> for prosodic information, the ToBI annotation conventions:
>
> http://www.cs.columbia.edu/~agus/tobi/labelling_guide_v3.pdf

Thanks for these. But what I wanted to find is what has been encoded
already with the macron half-marks, which can be elongated by U+FE26.

Unicode also includes characters only needed for the presentation
purpose and nothing else such as the Arabic kashidas (whose glyphs
assigned in fonts, is internally used by text renderer to position the
joining line, by juxtaposing one or more of them between letters, on
order to justifiy text, even if those characters are not really
encoded in standard texts, but only on monospaced texts !)...

Do you argue here that U+FE26 is not plain-text ? Can't you see that
<x,U+FE24,y,U+FE25> also has an alternate encoding with a
"double"-wide macron encoded between x and y, and that double
diacritics were really not needed if we already had the standard
convention of using half-marks to mark the beginning an end of an
elongated diacritic applying to runs of more than one grapheme cluster
? (assuming that all half-marks are not reorderable under
normalizations, i.e. they have a zero combining class) ? And that
finally, all these macron variants are exactly the same sign, just
applied to a different number of grapheme clusters (either 1, or 2
with "double" marks, or more when using half-marks) ?

What is plain-text then ? Well, only what the UTC and WG2 agree to
encode, only because there's already a constant or historical use. As
of today, those elongated marks already have an established, historic
and constant use. That's why I advocate for a change of paradigm to
avoid these mutliplications :

In a past message a few days ago, I wanted to propose format controls
to mark the beginning and end of "extended clusters". That would solve
everything very simply, reusing the same encoded diacritics without
defining any more "double" or "half" marks... And even if those
"extended clusters" cannot have their layout shown exactly (due to
limitations in renderers), these controls would be representable by a
visible glyph of their own (for example, right or left, dotted half
circles, in dotted squares) to which we could still apply and
represent the standard diacritics (without needing to reposition and
resize them in a complex way only supported by more advanced
renderers), using all existing font technologies (pending for later
improvements in renderers and fonts to support better layouts without
using pseudo-glyphs).

-- Philippe.
Received on Fri Nov 18 2011 - 20:48:37 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 18 2011 - 20:48:39 CST