Re: A sign/abbreviation for "magister" from Philippe Verdy via Unicode on 2018-11-03 (Unicode Mail List Archive)

From: Philippe Verdy via Unicode <unicode_at_unicode.org>
Date: Sat, 3 Nov 2018 21:45:40 +0100

As an additional remark, I find that Unicode is slowly abandoning its
initial goals of encoding texts logically and semantically. This was
contrasting to the initial ISO 106464 which wanted to produce a giant
visual encoding, based only on code charts (without any character
properties except glyph names and an almost mandatory "representative
glyph" which allowed in fact no variation at all).

The initial ISO 10646 goal failed to reach a global adoption. What proved
to be extremely successful (and allowed easier processing of text, without
limiting the variation of glyph designs needed and wanted for the
orthography of human languages) was the Unicode character encoding model,
based on logical semantic encoding. This drove the worldwide adoption (and
now the fast abandon of legacy charsets, all based on visual appearance and
basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO
standards, or other national standards, including in China, Japan, Europe,
or made and promoted by private hardware manufacturers or software
providers, frequently as well with legal restrictions such as MacRoman with
its well known Apple logo)

It is desesperating to see that Unicode does not resist to that, and even
now refuses the idea of adding just a few simple combining characters (that
fit perfectly in its character encoding model, and still allows efficient
text processing, and rendering with reasonnable fallbacks) that will
explicitly encode the semantics (a good example in Latin: look at why the
lower case eth letter seems to have three codes: this is because theiy have
different semantics but also map to different uppercase letters, and being
able to transform letter cases, and being able to use collation for
plain-text search is an extremely useful feature possible only because of
Unicode character properties, but impossible to do with just the visual
encoding and charts of ISO 10646; the same is true about Latin A versus
Cyrillic A and Greek ALPHA: the semantics is the first goal to respect,
thanks to Unicode character properties and the Unicode character model, but
the visual encoding is definitely not a goal).

So before encoding characters in Unicode, the glyph variation is not enough
(this occurs everywhere in humane languages): you need a proof with
contrasting pairs, showing that the glyph difference makes a semantic
difference and requires different processing (different character
properties).

Unicode has succeeded everywhere ISO 10646 has failed: efficient processing
of humane languages with their wide variation of orthographies and visual
appearance. The other goals (supporting technical notations, like IPA,
maths, music, and now emojis!), driven by glyph requirements everywhere
(mandated in their own relevant standard) is where Unicode can and even
should promote the use of variation sequences, and definitely not dual
encoding as this was done (Unicode abandoning its most useful goal, not
resisting to the pressure of some industries: this has just created more
issues, with more difficulties to correctly and efficiently process texts
written in humane languages).

The more Unicode evolves, the more I see that it will turn the UCS in what
the ISO 10646 attempted to do (and failed): turn the UCS into a visual
encoding, refusing to encode **efficiently** any semantic differences. And
this will become a severe problems later with the constant evolution of
humane languages.

I press Unicode to maintain its "character encoding model" as the path to
follow, and that it should be driven by semantic goals. It has every
features needed for that : combining sequences (including CGJ because of
canonical equivalences that were needed due to roundtrip compatibility with
legacy non-UCS charsets), variation selectors (ONLY to optionally add some
*semantic* restrictions in the largely allowed variation of glyphs and
still preserve distinction between contrasting pairs, but NOT as a way to
encode non-semantic styles), and character properties to allow efficient
processing.

Le sam. 3 nov. 2018 à 21:02, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :

> As well the separate encoding of mathematical variants could have been
> completely avoided (we know that this encoding is not sufficient, so much
> that even LaTeX renderers simply don't need it or use it !).
>
> We could have just encoded a single <combining mathematical symbol> to use
> after any base cluster, and the whole set was covered !
>
> The additional distinction of visual variants (monospace, bold, italic...)
> would have been encoded using variation selectors after the <combining
> mathematical symbol>: the semantic as a mathematical symbols was still
> preserved including the additional semantic for distinguishing some symbols
> in maths notations like "f(f)=f" where the 3 "f" must be distinguished
> (between the function in a set of functions, the source belonging to one
> set of values or being a variable, and the result in another set which may
> be a value or variable.
>
> Once again this covered all the needs without using this duplicate
> encoding (that was NEVER needed for roundtrip compatibility with legacy
> non-UCS charsets).
>
> All I ask is reasonnable: it's just a SINGLE code point to encode the
> combining mark itself, semantically, NOT visually.
>
> The visual appearance can be controlled by an additional variation
> selector to cancel the effect of glyph variations allowed for ALL
> characters in the UCS, where there's just a **non-mandatory** form
> generally used by default in fonts and matching more or less the
> "representative glyph" shown in the Unicode and ISO 10646 charts, which
> cannot show all allowed variations (if there's a need to detail them,
> Unicode offers the possibility to ask to register known "variation
> sequences" which can feed a supplementary chart showing more representative
> glyphs, one for each accepted "variation sequence", but without even
> needing to modify the "representative glyph" shown in the base chart.
>
> Note that even if Unicode requires registration of variation sequences
> prior to using them, the published charts still omit to add the additional
> charts (just below the existing base chart) showing representative glyphs
> for accepted sequences, with one small chart per base character, listing
> them simply ordered by "VSn" value. All what Unicode publishes is only a
> mere data list with some names (not enough for most users to be ware that
> variations can be encoded explicitly and compliantly)
>
>
> Le sam. 3 nov. 2018 à 20:41, Philippe Verdy <verdy_p_at_wanadoo.fr> a écrit :
>
>>
>>
>> Le ven. 2 nov. 2018 à 20:01, Marcel Schneider via Unicode <
>> unicode_at_unicode.org> a écrit :
>>
>>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
>>> [quoted mail]
>>> >
>>> > Using variation selectors is only appropriate for these existing
>>> > (preencoded) superscript letters ª and º so that they display the
>>> > appropriate (underlined or not underlined) glyph.
>>>
>>> And it is for forcing the display of DIGIT ZERO with a short stroke:
>>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
>>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>>>
>>> From that it becomes unclear why that isn’t applied to 4, 7, z and Z
>>> mentioned in this thread, to be displayed open or with a short bar.
>>>
>>> > It is not a solution for creating superscripts on any letters and
>>> > mark that it should be rendered as superscript (notably, the base
>>> > letter to transform into superscript may also have its own combining
>>> > diacritics, that must be encoded explicitly, and if you use the
>>> > varaition selector, it should allow variation on the presence or
>>> > absence of the underline (which must then be encoded explicitly as a
>>> > combining character.
>>>
>>> I totally agree that abbreviation indicating superscript should not be
>>> encoded using variation selectors, as already stated I don’t prefer it.
>>> >
>>> > So finally what we get with variation selectors is: <baseline letter,
>>> > variation selector, combining diacritic> and <baselineletter
>>> > precombined with the diacritic, variation selector> which is NOT
>>> > canonically equivalent.
>>>
>>> That seems to me like a flaw in canonical equivalence. Variations must
>>> be canonically equivalent, and the variation selector position should
>>> be handled or parsed accordingly. Personally I’m unaware of this rule.
>>> >
>>> > Using a combining character avoids this caveat: <baseline letter,
>>> > combining diacritic, combining abbreviation mark> and <baselineletter
>>> > precombined with the diacritic, combining abbreviation mark> which
>>> > ARE canonically equivalent. And this explicitly states the semantic
>>> > (something that is lost if we are forced to use presentational
>>> > superscripts in a higher level protocol like HTML/CSS for rich text
>>> > format, and one just extracts the plain text; using collation will
>>> > not help at all, except if collators are built with preprocessing
>>> > that will first infer the presence of a <combining abbreviation mark>
>>> > to insert after each combining sequence of the plain-text enclosed in
>>> > a italic style).
>>>
>>> That exactly outlines my concern with calls for relegating superscript
>>> as an abbreviation indicator to higher level protocols like HTML/CSS.
>>>
>>
>> That's exactlky my concern that this relation to HTML/CSS should NOT
>> occur at all ! It's really not the solution, HTML/CSS styles have NO
>> semantic at all (I demonstrated it in the message you are quoting).
>>
>>
>>> > There's little risk: if the <combining abbreviation mark> is not
>>> > mapped in fonts (or not recognized by text renderers to create
>>> > synthetic superscript scripts from existing recognized clusters), it
>>> > will render as a visible .notdef (tofu). But normally text renderers
>>> > recognize the basic properties of characters in the UCD and can see
>>> > that <combining abbreviation mark> has a combining mark general
>>> > property (it also knows that it has a 0 combinjing class, so
>>> > canonical equivalences are not broken) to render a better symbols
>>> > than the .notdef "tofu": it should better render a dotted circle.
>>> > Even if this tofu or dotted circle is rendered, it still explicitly
>>> > marks the presence of the abbreviation mark, so there's less
>>> > confusion about what is preceding it (the combining sequence that was
>>> > supposed to be superscripted).
>>>
>>> The problem with the <combining abbreviation mark> you are proposing
>>> is that it contradicts streamlined implementation as well as easy
>>> input of current abbreviations like ordinal indicators in French and,
>>> optionally, in English. Preformatted superscripts are already widely
>>> implemented, and coding of "4ᵉ" only needs two characters, input
>>> using only three fingers in two times (thumb on AltGr, press key
>>> E04 then E12) with an appropriately programmed layout driver. I’m
>>> afraid that the solution with <combining abbreviation mark> would be
>>> much less straightforward.
>>>
>>
>> This is not a real concern: this is legacy old practives that should no
>> longer be recommanded as it is ambiguous (nothing says that "4ᵉ" is an
>> abbreviated ordinal, it can as well be 4 elevated to the power e, or
>> various other things).
>>
>> Also the keys to press on a keyboard is absolutely not a concern: the
>> same key presses you propose can as well generate the letter followed by
>> the combining abbreviation mark. In fact what you propose is even less
>> practical because it uses complex input for all characters and requires
>> mapping keys on the whole alphabet (so it uses precious space on the key
>> layout). It's just simpler for everyone to press "4", "e", followed by a
>> combination (like AltGr+".") to produce the <combining abbreviation mark> !
>>
>> And these legacy superscript characters still are not warrantied to not
>> have any underline (the variation may as well be significant), and there
>> will never be enough superscript characters for the many superscript
>> notations (not just abbreviations) that should still be encoded the normal
>> letters (including in clusters, with diacritics, ligatures and so on):
>> Unicode will never accept to reencode all existing letters (plus all the
>> infinite set of clusters that can be formed with them) just to turn them
>> into superscript/subscript variants. These encodings that found their way
>> from the need of roundtrip compatibility of legacy charsets (before the
>> UCS) should have never occured at all: these should have not even been
>> tolerated for IPA symbols, for mathematical symbols (monospace, bold,
>> italic...).
>>
>> The variation selector solution is also not suitable when the intent is
>> only to add semantic to the encoded text and not drive the choice between
>> glyph variants (when the default glyph without the variant selector can
>> FREELY vary into forms that are UNACCEPTABLE in some contexts, then the
>> variation does not really encode the semantic but encodes the visual
>> rendering intent: it is too easily abuse to do something else).
>> But a single *semantic* combining mark does not encode any visual
>> rendering intent like what variation selectors do. They still allow glyphic
>> variations as long as the the semantic is kept, and they have the correct
>> fallbacks (there's no obscuring of the encoding of the clusters to which
>> the semantic combining mark applies: they are still part of the same
>> general encoding as normal letters, and rendering abbreviation mark does
>> not necessarily means that the base cluster MUST be rendered differently
>> than normal letters: it is permitted as well to render the combining mark
>> for example as a dot, or as a true diacritic on top of the letters). And if
>> needed the following can control the visual appearence:
>>
>>> >
>>> > The <combining abbreviation mark> can also have its own <variation
>>> > selector> to select other styles when they are optional, such as
>>> > adding underlines to the superscripted letter, or rendering the
>>> > letter instead as underscript, or as a small baseline letter with a
>>> > dot after it: this is still an explicit abbreviation mark, and the
>>> > meaning of the plein text is still preserved: the variation selector
>>> > is only suitable to alter the rendering of a cluster when it has
>>> > effectively several variants and the default rendering is not
>>> > universal, notably across font styles initially designed for specific
>>> > markets with their own local preferences: the variation selector
>>> > still allows the same fonts to map all known variants distinctly,
>>> > independantly of the initial arbitrary choice of the default glyph
>>> > used when the variation selector is missing).
>>>
>>>
Received on Sat Nov 03 2018 - 15:46:17 CDT

This archive was generated by hypermail 2.2.0 : Sat Nov 03 2018 - 15:46:17 CDT