Re: A sign/abbreviation for "magister" from Marcel Schneider via Unicode on 2018-10-30 (Unicode Mail List Archive)

From: Marcel Schneider via Unicode <unicode_at_unicode.org>
Date: Tue, 30 Oct 2018 16:52:47 +0100 (CET)

Rather than a dozen individual e-mails, I’m sending this omnibus reply
for the record, because even if here and in CLDR (SurveyTool forum and
Trac) everything has already been discussed and fixed, there is still
a need to stay acknowledging, so as not to fail following up, with
respect to the oncoming surveys, next of which is to start in 30 days.

First here: On 29/10/2018 at 12:43, Dr Freytag via Unicode wrote:

[…]
> The use of superscript is tricky, because it can be optional in some
> contexts; if I write "3rd" in English, it will definitely be
> understood no different from "3rd".

[Note that this second instance was actually intended to read "3ʳᵈ",
but it was formatted using a higher-level protocol.]

[…]
> In TeX the two transition fluidly. If I was going to transcribe such
> texts in TeX, I would construct a macro […]
[…]
> Nevertheless, I think the use of devices like combining underlines
> and superscript letters in plain text are best avoided.

While most other scripts from Arabic to Duployan are generously granted
all and everything they need for accurate representation, starting with
preformatted superscripts and ending with superscripting or subscripting
format controls, Latin script is often quite deliberately pulled down
in order to make it unusable outside high-end DTP software, from
TeX to Adobe InDesign, with the notable exception of sparsely and
parsimoniously encoded preformatted characters for phoneticists and
medievalists. E.g. in Arabic script, superscript is considered worth
encoding and using without any caveat, whereas when Latin script is on,
superscripts are thrown into the same cauldron as underscoring.

Obviously Unicode don’t apply to Latin script the same principle they
do to all other scripts, i.e. to free preformatted letters as suitable
if they are part of a standard representation and in some cases are
needed to ensure unambiguity. Mediterranean locales had preformatted
ordinal indicators even in the Latin-1-only era, despite "1a" and "2o"
may be understood no different from "1ª" and 2º". The degree sign, that
is on French keyboards, is systematically hijacked to represent the
"n°" abbreviation, unless a string is limited to ASCII-only. Several
Latin-script-using locales have standard representations and strong
user demands for superscripts, which instead of being satisfied on
Unicode level as would be done for any other of the world’s scripts,
are obstinately rebuffed when not intended for phonetics, or in
some cases, for palaeography.

I wasn’t digging down to find out about those UTC members who on a
regular basis are aggressively contradicting ballot comments about
encoding palaeographic Latin letters, while proving unable to sustain
any open and honest discussion on this List or elsewhere. Referring to
what Dr Everson via Unicode wrote on 28/10/2018 at 21:49:

> I like palaeographic renderings of text very much indeed, and in fact
> remain in conflict with members of the UTC (who still, alas, do NOT
> communicate directly about such matters, but only in duelling ballot
> comments) about some actually salient representations required for
> medievalist use.

That said: On 29/10/2018 at 09:09, James Kass via Unicode wrote:
[…]
> If I were entering plain text data from an old post card, I'd try
> to keep the data as close to the source as possible. Because that
> would be my purpose. Others might have different purposes.
> As you state, it depends on the intention. But, if there were an
> existing plain text convention I'd be inclined to use it.
> Conventions allow for the possibility of interchange, direct
> encoding would ensure it.

The goal of discouraging Latin superscripts is obviously to ensure
that reliable document interchange is limited to the PDF.

If Unicode were allowed to emit an official recommendation to use
preformatted superscripts in Latin script, too, then font designers
would implement comprehensive support of combining diacritics, and
any plain text including superscripted abbreviations could use the
preformatted characters, in order to gather the interoperability
that Unicode was designed for. Referring to what Dr Verdy via Unicode
wrote on 28/10/2018 at 19:01:

[…]
> However it is still not very elegant if we stil need to use only
> the limited set of superscript letters (this still reduces the
> number of abbreviations, such as those commonly used in French
> that needs a superscript "é")

The use of combining diacritics with preformatted superscripts is
also the reason why Unicode is limiting encoding support to base
letters, even for preformatted superscript letters. The rule that
no *new* precomposed letters with acute accent are encoded anymore
applies to superscripts too. A Unicode-conformant way to represent
such abbreviations would IMO use U+1D49 followed by U+0301: ,ᵉ́,.
Other representations may require OpenType support, which in Latin
script is often turned off, supposedly in order to shift to higher
level protocols what Unicode makes available in plain text.
Referring to what Dr Kass wrote on 29/10/2018 at 01:05:

[…]
> "Mr͇" for display purposes may look as daft as "/italics/", but
> it captures the elements of the text of the original manuscript.
> And it would allow preservation of abbreviations such as for
> "constitutionalité" → "Ct͇é͇".

Using superscripts plus combining diacritics might be a way to
address the limitations Dr Verdy mentioned on 30/10/2018 at 02:56:

[…]
> Obviously the Latin script should not use any kind of visual
> encoding, and even the superscript letters (initially introduced
> for something else, notably as distinct symbols for IPA) was not
> the correct path (it also has limitation because the superscript
> letters are quite limited; […]

But for font designers to implement combining diacritics for use
with preformatted superscripts, Unicode needs to explicitly allow
or recommend the use of preformatted superscripts in abbreviations.

This use case is different from the use case that led to submit
the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:

[…]
> The abbreviation in the postcard, rendered in plain text, is "Mr".
> Bringing U+02B3 or U+036C into the discussion just fuels the
> recurring demands for every Latin letter (and eventually those
> in other scripts) to be duplicated in subscript and superscript,
> à la L2/18-206.

IMO this proposal implodes when considering that the preformatted
characters are supposed to be inserted by the application rather
than directly out of keyboard drivers.

The document L2/18-206 seems to originate from the observation
of poor fonts and rendering engines in low-end document editing
software. As previously mentioned, the fix is already available
using high-end DTP software. That is sustainable as long as no
locales are impacted. What this thread is about is a digitally
interoperable representation of actual languages. E.g. small caps
is out of scope, given the postcard writer did not write the names
in small caps, that in Latin script are merely a stylistic
convention intended for scientific publication and so on — while
Cyrillic script currently uses “small caps” to write in lowercase.

Cyrillic also uses the № sign, that is mapped to the second level
on key E03 ("3" key) on the Russian and other Cyrillic keyboards.
Russian keyboard layout:
https://docs.microsoft.com/en-us/globalization/keyboards/kbdru.html
Bulgaran (phonetic traditional) keyboard layout:
https://docs.microsoft.com/en-us/globalization/keyboards/kbdbgph1.html

Perhaps the Numero sign is used in Cyrillic after it had been encoded
for East Asian as Dr Wallace via Unicode hinted on 28/10/2018 at 21:20:

[…]
> AIUI, № was encoded as a compatibility character because it appears
> in some East Asian character sets

Still № is also encoded in ISO/IEC 8859-5, at 0xf0.

Further, Dr Whistler via Unicode stated on 30/10/2018 at 05:54:

[…]
> The mere fact that some visual aspect of graphic representation on a
> page of paper can be implemented via a mechanical typewriter does not,
> ipso facto, mean that particular feature is plain text. The fact that I
> could also implement superscripting and subscripting on a mechanical
> typewriter via turning the platen up and down half a line, also does not
> make *those* aspects of text styling plain text. either.

The reverse is true, too: The fact that some language representation was
performed by tweaking the typewriter didn’t tag that representation as not
plain text. E.g. the LATIN CAPITAL LETTER C WITH CEDILLA couldn’t be typed
by holding Shift and hitting "ç"—key E09, the "9" key—on a French keyboard.
Nevertheless it is required for legibility when "ç" occurs at the start of
a sentence or in all-caps.
The workaround was to type a COMMA over LATIN CAPITAL LETTER C.

Likewise, SUPERSCRIPT TWO was available on French (France) typewriters,
and Belgian French ones had SUPERSCRIPT THREE, too. Also, again, the now
MODIFIER LETTER SMALL O was and still is emulated using the DEGREE SIGN
(on level 2 of key E11). The fact that other superscript letters needed
turning the platen does not make them belong to rich text, today.

It’s as Dr Kass via Unicode put it on 30/10/2018 at 10:09 when replying
to Dr Whistler via Unicode (above):

[…]
> If the typist didn't intend to put a superscript "r" on that page with a
> double underline, the typist wouldn't have bothered with all that jive.
>
> It's about the importance one places on respecting authorial intent.
>
[…]
> […] Underscoring might be stripped without messing with the legibility,
> but so could tatweels and lots of other stuff. […]

If the intent of Unicode is to discriminate Arabic script vs Latin script,
that would be worth mentioning in the Standard.

Making claims about interoperability and about unambiguous representation
of all of the world’s scripts, Unicode is expected to do so for Latin, too.

Dr Bień via Unicode wrote on 29/10/2018 at 06:40:

> > […] It's a matter of opinion, and opinions often differ.
>
> Well said, but I make the claim stronger; it depends on the purpose of
> the encoding and intended applications.

Dr Everson via Unicode replied to Dr Karocki on 28/10/2018 at 22:55:
>
> I think that it is the _superscription_ that indicates the fact that
> it is an abbreviation.

Hence Unicode is expected to fully support the use of plain text
superscript for those locales using superscript as an abbreviation
indicator, in the same role as other locales may use colon or period,
a usage that Dr Dürst via Unicode mentioned on 29/10/2018 at 08:04
responding to Dr Everson’s 05:42 (same day) e-mail:

[…]
> I think this may depend on actual writing practice. In German at least,
> it is customary to have dots (periods) at the end of abbreviations, and
> using any other symbol, or not using the dot, would be considered an error.

So should be, in some locales among which French, not using superscript.
It’s just that the perception of a superscript-less abbreviation that
normally uses superscript, is biased by the computer keyboard layouts
actually still in use (but hopefully soon to be enhanced by more complete
layouts).

Now is Unicode inspired by typewriting practice when designing the encoding
of Latin script, unlike what is done for potentially all other scripts?

Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn’t
know when replying to Dr Ewell on 29/10/2018 at 21:27:

[…]
> The English abbreviation Mr was also frequently superscripted in the
> 15th-17th centuries, and that didn't mean anything special either - it
> was just part of a general convention of superscripting the final
> segment of abbreviations, probably inherited from manuscript practice.

So English dropped the superscript requirement for common abbreviations
in the 17ᵗʰ or 18ᵗʰ century to keep it only for ordinals. Should Unicode
now take example on English to pull down the representation of French?
Fortunately it does not, as the French ordinal indicators are now a part
of CLDR, consistently with what the French national body intended when
setting up again a design process of a locale-conformant keyboard.

The rest of superscript abbreviation letters should follow in CLDR
when browsers will be using correct fonts for displaying the data.

We remember that The Unicode Standard explicitly specifies that the
glyphs of all superscript or modifier letters of a script shall be equalized.
No ransom note effect is allowed in Unicode-conformant fonts (except for
the purpose of artwork, as in Apple’s former San Francisco typeface).

Best regards,

Marcel
Received on Tue Oct 30 2018 - 10:53:18 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 30 2018 - 10:53:18 CDT