Re: Devanagari and Subscript and Superscript from Philippe Verdy on 2015-12-16 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 17 Dec 2015 02:50:09 +0100

2015-12-16 19:16 GMT+01:00 Doug Ewell <doug_at_ewellic.org>:

> The ones you suggest are stateful; they affect the rendering of
> arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48
> ("ANSI") attribute switching, or ISO 2022 character-set switching.
> Unicode tries hard to avoid encoding such things.

You can try as hard as you want, there are cases where it is impossible to
avoid stateful encoding if we want to avoid desunifications, or even for
some characters that cannot even work without stateful analysis.

And this is not solved just by style markup when that "style" is in fact
completely semantic. The situation must be taken into account with more
care :

- For example, the superscript Latin letter o, aka "ordinal masculine",
which is not just a superscript but a notation adding the semantics of a
abbreviation for the final letters, linked to the other letters before it,
the whole being semantically a single word: the superscript style does not
create such attachment, it creates a separate "word" inside it, so it was
disunified from the letter o.

- But it is not a good practive to encode in Unicode things that are just
styles without clear semantics (so encoding SUB/SUP is really a bad idea).

- On the opposite it is simply impossible to work with Egyptian hieroglyphs
as the default clusters are clearly insufficient to create ANY kind of
plain-text: you need extra markup to add the necessary semantic, not style,
and this markup should be encodable as plain-text without external markup
for the presentation when this presenation is fully semantic and clear
(e.g. the Egyptian "cartouche" for names of kings).
- Similar issue occur with SingWriting and other scripts that DO require
always a complex (non-linear) layout where basic clusters are clearly
insufficient in ALL texts, meaning that the characters that were encoded
are almost **useless** in all plain-text documents: you need extra "format"
characters to create some form of orthographic rule, independantly of the
style or from an external markup language.

I'm in favor of adding **semantic** format characters in Unicode, not
stylistic-only format characters, as soon as there does exist a wellknown
orthographic convention which whould work independantly of styling. But for
now the encoded format characters only work on too small clusters, clusters
are only linear and this is clearly not enough (even for instructing other
kinds of text analysis (such as breakers).

Then the renderers will be adapted and extended to work with more complex
clusters with their internal structures with simpler clusters parts). Other
renderers using the legacy rules will not be able to do that but will
attempt to render some basic fallback (possibly with special visible glyphs
for those controls).

One kind of semantic format character which is useful and encoded is the
"invisible parentheses" for mathematics, which can be encoded for example
after a radical sign: use them around a number to define the extension of
the radical to more than one digit (and make a clear visual and semantic
distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render
any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and
"sqrt(2)+sqrt(3)").
Received on Wed Dec 16 2015 - 19:51:38 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 16 2015 - 19:51:38 CST