Re: Roman Numerals (was Re: Improper grounds for rejection of proposal N2677)

From: Kenneth Whistler (
Date: Mon Oct 31 2005 - 17:45:19 CST

Marc Bruguières asked:

> > The convention of using rulings over strings of Latin letters
> > to indicate higher values should be handled by styles, rather
> > than by individual insertion of combining lines over single characters.
> Why? (I can imagine a reason but please explain yourself.)

For the same reason that underscored text should be handled as
a style, instead of as u_n_d_e_r_s_c_o_r_e_d_ text using combining
character sequences.

Diacritic combining marks have (ideally) their scope as a single
base character, which they modify. That would include underscoring
or overscoring a single character -- which is why the standard includes
such combining marks.

But scoring text (including through-scoring) is an example of
a stylistic text decoration that takes an arbitrary span of text
as its scope. Such things are better handled by rendering processes
that include the notion of arbitrary text spanning as part of
their relevant concerns.
The Roman numeral overscoring convention takes a *chunk* of text
and scores it -- that is built into the definition of the convention.
It could be emulated in print by concatenating a bunch of individually
scored letters (using combining marks in a Unicode encoding), but
that is basically using forks to dig holes instead of using shovels.
> [Gregg Reybnold reasonably asked]
> >> Arguably, the thousand multiplier has a plain-text meaning that should be
> >> encodable as such.
> [To which Kenneth Whistler proposed this strange answer :]
> >Or..... arguably not.
> Well, argue it then.

I don't see anything particularly strange about that. Perhaps
I should have just said, "I disagree," to avoid attracting further
disagreement. But anyway...

To my mind this is comparable to claiming that "arguably" a
superscript number has a plain-text meaning that should be
encodable as such. And reasoning from that conclusion that arbitrary
spans of superscripted (or subscripted) numbers should then
be represented in terms of distinguished encoded characters,
rather than as superscript or subscript styles.

I think it is clear that the better, more generic representation
of superscript and subscript elements is via styles. And this
is the case, *even* though there are encoded characters for
superscript and subscript digits. Those are compatibility
characters in the first place, and in the second are
*convenience* characters for the one-off usages of occasional
superscripts in text. (...the reason why superscript 1, 2, and 3
were included in ISO/IEC 8859-1, by the way.)

The existence of a well-defined *concept* of thousands
multiplication doesn't automatically qualify a mark used
conventionally to represent that concept on paper as an
abstract character suitable for encoding as a character in
Unicode. For that, one needs to make the case regarding
what level of text representation (character versus
text-spanning styles) more appropriately fits with the
character encoding model and with implementations.
> Why is an indication of a thousand multiplication less
> worthy of plain text encoding than a macron in other places.

Well, to start off, it isn't a macron, but an overscore.
For a *single* character, as seen in the table Philippe
cited for the French Wikipedia entry, use of combining
overscore is perfectly appropriate as a way to represent
such text elements. For construction of entire, long,
numerical expressions, it is not.

> >Not every semantic distinction carried in written form is appropriate
> >for plain text, nor for encoding as a character.
> Well, perhaps, but why not in general and why not here?
> A thousand multiplier, simply a macron,

It isn't simply a macron.

> a simple enough and clear enough plain text sign looks to me.
> What is your definition of plain text?

See the Unicode Standard, 4.0, p. 18

> Right know your “arguably not” sounds as arbitrary as
> "those who know when something should be coded in plain text,
> those who do not know, don't know what to code in plain text."

Huh? Something incomplete there.

But the Unicode Standard goes on for hundreds of pages, and
much of that content can be taken, on a script-by-script basis,
as explaining what should (or should not) be represented
in plain text using Unicode characters.

People can and have come on this list claiming, for example,
that text color should be "coded in plain text", despite the
fact that the standard suggests otherwise. In such cases, it
shouldn't be too surprising that people retort, "arguably,
not..." without feeling the need to recapitulate the standard's
discussion of plain text for each email exchange.


This archive was generated by hypermail 2.1.5 : Mon Oct 31 2005 - 17:47:47 CST