Re: Character proposal: LOWER TEN

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jan 18 2008 - 15:01:56 CST

Next message: Karl Pentzlin: "Re: Proposal to encode three combining diacritical marks for Low German dialect writing"

Previous message: Michael Everson: "Re: Proposal to encode three combining diacritical marks for Low German dialect writing"
Maybe in reply to: Leo Broukhis: "Re: Character proposal: LOWER TEN"
Next in thread: Leo Broukhis: "Re: Character proposal: LOWER TEN"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> On Jan 18, 2008 8:12 AM, Mark Davis <mark.davis@icu-project.org> wrote:
> > We only encode new characters when there is no way to represent the
> > characters otherwise in Unicode.
>
> To wit: U+2254 ≔ COLON EQUAL and U+2A74 (⩴) DOUBLE COLON EQUAL

U+2254 COLON EQUAL dates from Unicode 1.0, and got into
Unicode as part of the initial symbol repertoire largely
because it was among the symbols repertoire from XCCS 1980,
the Xerox corporate character standard,
which was the primary source for the original math symbols.
It is encoded as 042<sub>8</sub>/124<sub>8</sub> (i.e. 0x2254)
in XCCS 1980, and yes that is an amazing numerical coincidence
and was not deliberate.

[Aside for the Unicode trivia buffs: What *other* Unicode
math symbol (other than ASCII) has the same numerical value
in Unicode *and* in XCCS 1980?]

U+2A74 DOUBLE COLON EQUAL got into the standard much later,
in Unicode 3.2, as part of the very large additional set
of math symbols requested by AMS and the STIX publication
group. Everything in the 2AXX block is part of that collection.

There were several reasons for encoding it also as a unit,
although I can assure you the UTC members debated that,
along with the identity of a number of other operators
and relations that could conceivably be represented by
sequence of other characters.

1. The precedent of U+2254 COLON EQUAL.

2. The fact that AMS and STIX had asked for it as a unit.

3. The need in computational algebraic systems to have
operators as single characters where possible.

Note, however, that the UTC *did* end up providing
a compatibility decomposition for U+2A74 (and U+2A75
and U+2A76, as well). It would have done the same for
U+2254, but for the constraints of normalization
stability.

> In a sense they belong to the same class as "lower ten" being an
> ALGOL character and an ALGOL meta-character.
> What was the convincing argument for their inclusion?

The arguments are usually not convincing to people looking
for logical consistency, and have a large degree of
historical contingency in them, because of the prior
history of what they were designed to be compatible
with.

Note that the ALGOL assignment operator got in as
a single encoded character, mostly because of its history
in the XCCS character encoding, even though in ASCII
program text it was always represented as a sequence
of ":" and "=", whereas most C digraphic and trigraphic operators
never did (and nobody wants them encoded that way):
++, --, <=, >=, !=, ||, &&, *=, +=, <<=, etc., etc.
"==" is one of the few exceptions (cf. U+2A75), but
that wasn't added to Unicode to stand for the C
equality operator, but instead as part of the
paradigmatic set "=", "==", and "===" in another
context altogether.

>
> Leo
>
> PS. I've changed the proposed character name- as printed, it did not
> descend as much as the subscript digits would, if at all, as
> all-capital drum printers and tele-typewriters had very little space
> under the baseline.

That's a glyph design issue, and unlikely to be a convincing
argument for distinguishing a "LOWER TEN" from a "SUBSCRIPT TEN".

First, you have pointed out that the semantics of this character
is for writing the decimal radix, and that has always notionally
been a subscript 10 in mathematics and computer science. The fact
that the all-capital drum printers and teletypes couldn't handle
true subscripting is a limitation of that technology, and not
an indication that the GOST 10859 standardizers actually had in mind
a character whose identity wasn't a subscript 10 but was something
entirely different.

Second, many extant fonts covering Unicode compatibility subscript
numbers deliberately design them as denominator glyphs, rather
than as true (descending below the baseline) subscript glyphs.
See:

http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

So treating the distinction between "lower" denominator glyphs
and subscript glyphs as a *character* distinction isn't a
good idea for the standard.

> PS. I Started writing a proposal and will post my humble attempt at it
> (is .odf OK?) shortly.

The docmeister could answer that. I suspect .pdf would be
preferable, as that format is already used for posting many
standards documents by both the UTC and SC2. .odf would probably
have to be converted.

--Ken

Next message: Karl Pentzlin: "Re: Proposal to encode three combining diacritical marks for Low German dialect writing"
Previous message: Michael Everson: "Re: Proposal to encode three combining diacritical marks for Low German dialect writing"
Maybe in reply to: Leo Broukhis: "Re: Character proposal: LOWER TEN"
Next in thread: Leo Broukhis: "Re: Character proposal: LOWER TEN"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 18 2008 - 15:04:44 CST