French Superscript Abbreviations Fit Plain Text Requirements (was: Re: a character for an unknown character) from Marcel Schneider on 2016-12-28 (Unicode Mail List Archive)

From: Marcel Schneider <charupdate_at_orange.fr>
Date: Wed, 28 Dec 2016 16:25:48 +0100 (CET)

Iʼm gladly surprised that this thread has unexpectedly come to a point
where Iʼd be able to spin off a topic written up in 1st draft 18 days ago.

On Tue, 27 Dec 2016 21:33:32 -0800, Asmus Freytag wrote:
>
[quoted mails]
>
> (Most) character properties can be adjusted, so the statement above would
> need to be drawn much more narrowly.
>
> The generic issue that Unicode runs into is that there are things like
> "letters" that have well-defined identities (the letter A), but, perhaps
> because of that, have a very wide ranging set of real images - some of the
> fanciful ones may bear scant relation to the archetypal shape. However,
> because they are members of bounded, an extremely well-known sets
> (alphabets) users are tolerant of artistic license. In addition, they are
> generally used in longer contexts (words) where their identity is
> reaffirmed, independent of their shape, by occurring in the expected
> juxtapositions (and mostly not occurring in other, unexpected ones).
>
> However, the conventions where and when to use one of these letters are not
> fixed, not even their phonetic equivalents.

As far as I understand the issue until here, Unicode does not fix usage
conventions, but merely gives some hints to the Code Charts and TUS reader
as of the original encoding rationale, and sometimes some later added contexts
where the character is found. For instance, U+202F NARROW NO-BREAK SPACE has
been encoded for Mongolian, but is also used in French, where it is preferred
with certain punctuations. TUS 9.0, §6.2, Space Characters, Narrow No-Break Space,
p. 269, says it “can be used to represent the narrow space occurring around
punctuation characters in French typography, which is called an ‘espace fine
insécable.’”

>
> Contrast that with many marks. The really common ones, like the period, are
> well- known enough that fonts can substitute small squares or other shapes
> without impeding their use in normal text. However, outside standard
> sentence punctuation, they can be re-used for many other purposes. Some
> such uses, like the Swedish use of ":" in the middle of an abbreviation,
> may be unusual enough to not readily be catered to by all text-processing
> software (e.g. in word-segmentation).

From this I conclude that a given character can be used following any convention,
regardless of the percentage of software that isnʼt yet up-to-date to handle it
correctly in every circumstance, including but not limited to equivalence classes.

This is relevant to the representation of abbreviations in French, that doesnʼt
use a colon or an (in-word) period when it comes to abbreviate for example the
word for 'numbers', or ordinals like '2nd', '3rd', '4th'. Before the recent threads,
this has been discussed a decade ago, so Iʼll pick out some highlights below.

>
> Nevertheless, the same thing applies as with letters: where and when to use
> one of these marks is not fixed as part of their encoding, not even their
> functions.

So definitely the use of superscript Latin letters can scarcely be limited to
IPA, though most of them were initially intended for (i.e. encoded for) phonetic
transcriptions. But cross-checking the relevant parts of the Standard [1][2]
leads to conclude that their use must be necessary for an unambigous representation
in plain text, following the Unicode definition of plain text: “/Plain text must
contain enough information to permit the text to be rendered legibly, and
nothing more./ \r\n The Unicode Standard encodes plain text.” (TUS 9.0, p. 19.)

Applied to the French abbreviation of “numéros” (numbers), that means that the
abbreviationʼs final letters 'os' *must not* be formatted as superscript: Since
“the extra information in rich text can always be stripped away to reveal the
‘pure’ text underneath” (TUS, ibid.), 'n^{os}' would end up as 'nos' (“our”,
with a plural noun). Consequently, best practice is to represent it using the
Unicode superscript “modifier letters”: 'nᵒˢ'.

>
> Many other "simple" marks: lines, circles, triangles, hooks, and squares,
> or groups of them, are likewise subject to frequent reuse. Some of them may
> have been incorrectly encoded more than once. Like the standard punctuation
> marks, both their precise shapes and precise functions are subject to
> stylistic or other conventions.

From this, it seems doubtful whether encoding the superscript small letter e
more than once would be accepted, since the possible rationale is mere fine-
tuning of the vertical alignment (modifier letters being typically less raised
than formatted superscripts).

>
> When it comes to marks (or symbols) of less generic or more complex shapes,
> the presumption that the mark only has "one" shape may be more common, and
> examples of the mark being repurposed may be less common. Not being as
> common, fewer readers will recognize all stylistic variations as being "the
> same thing". A variant form will be more likely to be understood as a
> related, but not identical symbol. That in turn fuels the misperception
> that Unicode somehow encodes symbols based on a single conventional usage.

Iʼm likely to believe that this settles all objections to the use of modifier
letters as superscripts wherever appropriate, as being “non-standard”, “a hack”
and the like. Such a narrow reading of the Unicode documentation is thus due
to a misperception that is fueled by a current user experience, additionally to
the TUS disclaimer that these characters are not intended to replace generic
formatting. The very reason why this guideline is applied to French abbreviations,
seems to be that relegating their correct representation into the realm of
higher-level protocols has been the way (why not calling it the “hack”), along
with the use of other available means (mainly the degree sign), to represent
them unambiguously even before Unicode provided the superscript letters.

The unambiguous and coherent representation of abbreviations with superscript
letters from plain text on upwards, gives eventually the French language the
status of an exception, admitting that in English, superscripting in this
context is a mere styling issue. But there seems not to be so much of an
exception, since ordinal indicators have been encoded for a small set of
languages in earlier standards. Adding one more exception will have very few
consequences on Unicodeʼs side. Presumably they wonʼt exceed the encoding of
MODIFIER LETTER SMALL Q, that was and is already the subject or a part of
past and eventual proposals, most of them not implying French. Please note again
that abbreviations like '2^{ème}' are officially deprecated in favor of '2^{e}'
style forms, so there is very scarce need of diacritics in French abbreviations.
Examples include 'S^{té}' (“Société”, Corporation). So updating fonts to support
combining diacritics here would be handy.

Above all, adding a corresponding statement in the Core Specifications, like for
the “‘espace fine insécable’”, would be nice, to make everybody at ease.

Additionally Iʼd like to cite the related _2006 thread_, quoting some snippets
that seem particularly interesting to me, some — but not all — of which are
referring to French abbreviations (a topic that already spun off another one):

http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0270.html
On Fri Mar 24 2006 - 17:15:22 CST, Kent Karlsson wrote:
> Antoine Leca wrote:
[…]
> > […] but it probably has to be done outside of the
> > codepoints
> That would be too frail, and not reliable.

http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0290.html
On Mon Mar 27 2006 - 09:03:48 CST, Antoine Leca wrote:
> Kent Karlsson wrote:
> > Antoine Leca wrote:
[…]
> > > Are you intending to say that if I wrote "Mme" (Mrs in
> > > French), I should differentiate, in a not yet standardised
> > > way, the fact that I write it with superscript characters
> > > or not? Saying it is a "spelling" difference?
> > Definitely. In this particular case one may debate whether to use
> > markup or to (ab)use U+1D50 MODIFIER LETTER SMALL M and
> > U+1D49 MODIFIER LETTER SMALL E.
> Put it in clear: to write the French equivalent of Mrs, I can:
> - either write the slightly incorrect Mme
> - or write the more "correct" M[][] (where [] represent the empty box that
> everybody except four cats will effectively see).
> Somewhere I am thinking this is *not* a working solution.

http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0291.html
On Mon Mar 27 2006 - 09:36:31 CST, Doug Ewell replied to Antoine Leca:
[…]
> This is what I consider the Great Unicode Conundrum. We want to use the
> rich character repertoire that Unicode provides, but we also want to
> avoid displaying mojibake on the user's screen, causing him to mumble
> and curse about "that stupid Unicode" or giving him security concerns.
> The problem persists because popular fonts are not always updated
> quickly and inexpensively to support new and rare Unicode characters.
> So we avoid using rare and -- more importantly -- newly added
> characters, preferring ASCII fallbacks of the sort Unicode was intended
> to replace.

http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0293.html
On Tue Mar 28 2006 - 02:19:49 CST, Antoine Leca replied to Doug Ewell:
[…]
> While I agree with your pertinent remark on a general way, in THIS case I
> believe this is not adequate. Those two characters (U+1D50 and U+1D49, ᵐᵉ)
> do not seem to me to be intended for French abbreviations (or any written
> language typographics effects), but rather for phonetics. As a result, it
> seems difficult to me to ask French people to have phonetics-specialized
> fonts, in order to read something as common as the abbreviation for Mrs,
> just because it caught the attention of someone that those characters almost
> fit that particular needs.
> I can be wrong though.

This may have been right by the time, while today most current fonts support them.
The mail is continued; interested readers are welcome to read more in the archive.
And a last one:

http://www.unicode.org/mail-arch/unicode-ml/y2006-m03/0294.html
On Tue Mar 28 2006 - 04:12:10 CST, Keutgen, Walter replied:
[…]
> In real every day usage, in French 'mechanical typewriting', 'PC typewriting' and
> *hand writing* one did/does not superscript the endings of the abbreviations
> 'Mme, Mmes, Mlle, Mlles, Dr, Drs, Ir, Irs'.
> In hand writing one always uses superscripts for ordinal numbers, which is not
> possible in flat text PC writing and required some fumbling whith the mechanical
> typewriter. I.e. '1er', '2ème' or '2e' etc require superscripts, likewise the forms
> derived from the Latin wording '1o, 2o' etc for which one uses the '°'. One also
> often sees Me (maître = master in law) with a superscripted e. Now as to know
> whether '°' is a superscripted 'o' or the degree sign, my keyboard does not tell
> me. I would bed however that the '°' often is smaller than the superscripted e.

Users can thus feel free to follow an already existing practice, by upgrading
whether the keyboard layout, or the autocorrect or AutoHotKey or whatever.
Cf. the Bing search results for "ᵉ":
https://www.bing.com/search?q=%E1%B5%89&PC=U316&FORM=CHROMN

Best regards,

Marcel

[1] TUS 9.0, §7.8, p. 327:
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G24762

[2] TUS 9.0, §22.4, p. 786:
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#G42931
Received on Wed Dec 28 2016 - 09:25:48 CST

This archive was generated by hypermail 2.2.0 : Wed Dec 28 2016 - 09:27:11 CST