Re: French Superscript Abbreviations Fit Plain Text Requirements from Marcel Schneider on 2016-12-29 (Unicode Mail List Archive)

From: Marcel Schneider <charupdate_at_orange.fr>
Date: Thu, 29 Dec 2016 22:20:24 +0100 (CET)

Thank you for your answers and advice. Some points however remain
still unclear to me.

On Wed, 28 Dec 2016 13:47:00 -0800, Asmus Freytag wrote:
>
> On 12/28/2016 7:25 AM, Marcel Schneider wrote:
> >
> > Applied to the French abbreviation of “numéros” (numbers), that means that the
> > abbreviationʼs final letters 'os' *must not* be formatted as superscript: Since
> > “the extra information in rich text can always be stripped away to reveal the
> > ‘pure’ text underneath” (TUS, ibid.), 'n^{os}' would end up as 'nos' (“our”,
> > with a plural noun). Consequently, best practice is to represent it using the
> > Unicode superscript “modifier letters”: 'nᵒˢ'.
>
> This is seriously overstating the plain text principle.
>
> There are many places where formatting affects the reading (and not just
> the presentation) of the text. In some cases, it is appropriate to encode
> characters for that, in other places the conclusion is simply that plain
> text is not sufficient.
>
> In English, superscript is used for ordinal numbers. The fallback without
> superscript tends to be functional, because of the alternation between
> digits and letters, but there's nothing "pure" about it.
>
> Some sentences in English can be parsed ambiguously; the convention in
> print has been to italicize the word intended to take the stress. Here, the
> plain-text fallback is less functional, as it re-introduces the ambiguity.
>
> There is no rule that says that *all* content information *must* be
> expressible on the plain text level. Some edge cases exist, where other
> layers, by necessity, participate.
>
> Mathematical notation is a good example of such a mixed case: while
> ordinary variables can be expressed in plain text with the help of
> mathematical alphabets, the proper display of formulas requires markup.
> Even Murray Sargent's plain text math is markup, albeit a very clever one
> that re-uses conventions used for the inline presentation of mathematical
> expression. (Where that is insufficient, it introduces additional
> conventions, clearly extraneous to the content, and hence markup).
>
> The encoding conventions (principles) chosen by Unicode stipulate that for
> ordinary text (not notations) any part of the content that requires
> alternate presentation (italics, superscript, etc) is to supplied via
> styles, not coded characters. In contrast, for scholarly or technical
> notation, that requirement is relaxed.
>
> As long as French is ordinary text, the abbreviations require styled (rich)
> text.

I see that this makes for a much more streamlined implementation, because of
the thousands of decorative fonts that donʼt supply the modifier letters. So
my “*must not*” was too harsh. On the other hand, I see an issue about whether
to stick with legacy practice, or to allow the user to choose an alternate way.

According to TUS (9.0, §22.4, p. 786), vertical alignment in '1^{st}' and in
'DC00_{16}' is to be handled with markup. In the latter case, this Unicode
recommendation leads to content corruption when the related markup is stripped
off. That may occur sooner than expected, e.g. in Word (2010) when a character
style is applied. In the former case, if there's nothing "pure" about '1st' and
the other English ordinal plain ASCII fallbacks, the actual Unicode recommendation
can hardly be the last word here neither. Distinguishing mathematical notation
(that the base of the numeral system seems to to be considered to belong to) and
technical notation may also add to the problem. Writing '1ˢᵗ' and 'DC00₁₆' could
be a way to solve it.

Another—admittedly much more straightforward—way to solve the problem is to stick
with baseline letters and punctuation. German and French may denote stress with
titlecase (“Nur Eine mögliche Lösung” [‘Only one possible solution’], sometimes
considered obsolete; “À la Une” [‘On the cover page’], current French), while
Dutch uses the (combining) acute accent (subject of a recent thread).
As of italics, they can be avoided in English and French if the sentence is worded
differently (as in “Superscript /can/ be used in abbreviations, but in some
languages this is not mandatory” becomes “Albeit superscript can be used […].”
Effectively in Spanish there seems to be a move from superscript to baseline
letters in abbreviations, so that “Señor” and “Señora” shorten to “Sr.” and “Sra.”,
preferredly to the (obsolete) “S.^r” and “S.^a” (the latter sometimes written using
the feminine ordinal indicator: “Nª Sª” [‘Our Lady’]).

On Thu, 29 Dec 2016 09:35:54 +0100, Philippe Verdy wrote:
>
> I agree. Even for the abbreviation "Nos" or "nos",
> there's no ambiguity due to the grammar (in a sentence the abbreviation
> would be preceded by an article ("les nos 2 et 3") or a noun ("les articles
> nos 2 et 3) and followed by numerals and this cannot be analyzed like the
> possessive "nos" which cannot appear after an article or noun.

You are plain right, my demonstration was too abstract. “Nos nos 2 et 3”
[‘Our nos. 2 and 3’] would be a bit confusing, though.

>
> If you want to represent only plaintext the typographic superscripts could
> still be replaced by inserting an abbreviation dot ("les n.os 2 et 3) or by
> not abbreviating it at all ("les numéros 2 et 3"). These superscript are
> presentational only. The same applies to other abbreviations such as "Mgr"
> ("Monseigneur", which can be typeset as "Mgr", "Bd"
> ("Boulevard", typeset as "Bd"), "Mlle" ("Mademoiselle", typeset
> as "Mlle") and many, many abbreviations suffixing the last
> letters. of a word that are preferably typeset using superscripts, but that
> are still normal Latin letters, including letters with accents (notably "é"
> which is frequent at end of French participles or nouns and which has no
> encoded superscript variant).

The idea is mainly that if one is bound to plain text and nevertheless wants to
follow high-end presentation rules as a mark of respect, then it would be a pity
not to use the means that are already available in Unicode and current fonts.

Another advantage of the use of modifier letters is stability. A source text that
has these superscripts hard-coded, can either be used as-is, or it can be parsed
for these modifier letters to be replaced with styled baseline letters.
Precomposed superscripts are not needed (and will never be encoded), given that
future practice may direct font design to support combining diacritics herein.

>
> Adding superscript variants (or other typographic variants) in Unicode for
> that use would mean reencoding thousands letters in many scripts and in a
> dozen of stylistic variants. This is not the way to go.

Sure. I donʼt see however how many languages and scripts do effectively use
superscripts the way a few Latin script using languages do. My feeling is that
there are none, but Iʼm at risk of being wrong.

As of other stylistic variants, the mathematical alphabets are effectively used
outside mathematics. Google Search is already able to handle them as if they were
plain ASCII.

>
> Plain text documents have their constraints, if clarity is needed they are
> necessarily modified with additional text, but converting a rich text to
> plain text and dropping all styles is destructive and may cause ambiguity
> in some rare cases. But language semantics and grammar most often resolve
> them to give sense to that text and abbreviations in plain text will still
> be readable in most cases.

That is comforting.

Marcel
Received on Thu Dec 29 2016 - 15:20:50 CST

This archive was generated by hypermail 2.2.0 : Thu Dec 29 2016 - 15:20:50 CST