Re: combining: half, double, triple et cetera ad infinitum from Philippe Verdy on 2012-02-08 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 8 Feb 2012 12:30:02 +0100

2012/2/6 QSJN 4 UKR <qsjn4ukr_at_gmail.com>:
>> 2011/11/14 Philippe Verdy <verdy_p_at_wanadoo.fr>:
>>> And arguably, I have also wanted this since long, instead of the hacks
>>> introduced by the so called "double" diacritics and "half" diacritics
>>> that break the character identity of those diacritics and also
>>> introduce encoding ambiguities.
>>>
>>> In fact, those things would have been encoded since long if Unicode
>>> and ISO 10646 had extended their character model to cover a broader
>>> range of "structured character clusters".
>>>
>>> Two format characters (with combining class 0 for the purpose of
>>> normalizations) would have been enough for most applications:
>>> - U+xxx0 BEGIN EXTENDED CLUSTER (BEC)
>>> - U+xxx1 END EXTENDED CLUSTER (EEC)
>>> And then you would have encoded the standard diacritics after the
>>> sequence enclosed by these characters, for example cartouches (using
>>> an enclosing diacritic).
>>>
>>> A third format control would have been used as well to specify that
>>> two clusters (simple letters or letters with simple diacritics, and
>>> including extended clusters) would stack vertically instead of
>>> horizontally. With this third one, the basic structure would be
>>> encodable really as plain-text.
>>>
>>> Yes this would have not worked with today's OpenType specifications,
>>> but this would have been the place for extending those specifications
>>> and not something blocking the encoding process. i am still convinced
>>> that this should not be part of an "upper-layer standard', which is
>>> not interoperable, and complicates the integration of those
>>> pseudo-encoded texts.
>>>
>>> Once the structure is encoded as such, there is still the possibility
>>> to create a linear graphical representation as a reasonnable readable
>>> fallback exhibiting the structure unambiguously, even if the text
>>> renderer cannot produce the 2D layout (you just need to make those
>>> format controls visible by themselves with a glyph, or some other
>>> meaning offered in the text renderer, including with colors or various
>>> style effects).
>
> We don't need new special characters nor new half-characters nor new
> ccc as I proposed above. No!
> We already have the Annotation Characters!
> It is possible to use something like U+FFF9 ANNOTATION ANCHOR РКГ
> U+FFFA ANNOTATION SEPARATOR U+0483 COMBINING TITLO U+FFFB ANNOTATION
> TERMINATOR for Cyrrilic number 123 (РКГ under titlo). This way also
> titlos wit supralinear leters (like SLOVO TITLO, TVERDO TITLO, see
> http://ru.wikipedia.org/wiki/%d0%a2%d0%b8%d1%82%d0%bb%d0�) are implementable.
> The only question is right processing of annotation chunkes that start
> with nonstarter. I mean a being a combining character, without a base
> character, chunk of multiline annotation should use previous chunk as
> base (in best application).

This may in surface look as a good idea. But...

Those annotation separators are marking some external text that is
only poorly linked to the base text they annotate.

Their reading could be interpreted as if the content of the annotation
was an alternative version of the base text, i.e. as a possible
replacement, with the effect that the annotation does not need to be a
full sentence. The annotation being still a word or few words that
have some meaning by themselves, a renderer could place the annotation
in a side note or in the page footer, or (in interactive displays)
would display it in a floating window when hovering the base text, or
(printed on paper support) in a small window that you can lift to
discover the text behind (when lifting it, the normal base text
printed on the window gets hidden: the annotation is then a
replacement).

Such use is also similar to ruby notations in East-Asian texts, except
that the content of the ruby text only displays a small part of the
semantic, and is used to disambiguate or help pronouncing the base
text.

On all these examples, the content of the notation has a meaning by
itself. This is not the case if the annotation is just a diacritic
spanning a base text.

So your titlo in example only carries the poor semantic of the titlo
itself. It cannot be used as a replacement text, and makes no sense by
itself. Instead it is used to alter the meaning and reading of the
base text (changing letters into digits, or making distinctions betwen
sacred words like "God" in contrast to common (non sacred) "gods", or
to exhibit the difference about what is a really spelled word and an
abbreviation. There is no place here to have the titlo rendered
separately (for example in a side note in the margins of the text, or
in page footers, or in a line inserted between lines of normal text
(and rendered distinctly for example with smaller fonts).

The same is true with other combining signs used for epigraphic
notations: you cannot place it elsewhere than within the base text
itself. Its semantic is not a replacement, but an optional additional
content which is stringly tied to the base text. Any attempt to place
it elsewhere would make non-sense (for example if a renderer attempts
to surround the base text with a dotted bounding box, followed by a
mark linking to a side note or foot note, and the sidenote / footnode
displayed separately with the same leading mark followed by the
content of the annotation, here the combining mark).

Annotations that are just combining marks still need something better,
but I strongly feel that the mechanisms used for interlinear
annotations or for ruby texts, are suitable here.

My proposal for makring in that what constitutes an "extended cluster"
(with a possible fallback mechanism usable by renderers which can
still show some alternate glyphs for the two begin/end marks, such as
a small parenthese within a dotted box), is probably more suitable
here: it still allows the combining mark to be used immediately after
the "extended cluster" end mark (and still rendered correctly with the
fallback meachanism on top of the fallback glyph used to render the
end mark).

In other words, the end mark is designated as a suitable base
character after which standard combining marks can be encoded (instead
of using "double" combining marks and other "half" combining marks,
which would remain, for compatibility only, for the simplest cases
where they currently work, even though their separate encoding was
clearly a disunification that was poorly justified).
Received on Wed Feb 08 2012 - 05:39:58 CST

This archive was generated by hypermail 2.2.0 : Wed Feb 08 2012 - 05:40:14 CST