Re: Character proposal: LOWER TEN

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jan 18 2008 - 15:01:56 CST

  • Next message: Karl Pentzlin: "Re: Proposal to encode three combining diacritical marks for Low German dialect writing"

    > On Jan 18, 2008 8:12 AM, Mark Davis <mark.davis@icu-project.org> wrote:
    > > We only encode new characters when there is no way to represent the
    > > characters otherwise in Unicode.
    >
    > To wit: U+2254 ≔ COLON EQUAL and U+2A74 (⩴) DOUBLE COLON EQUAL

    U+2254 COLON EQUAL dates from Unicode 1.0, and got into
    Unicode as part of the initial symbol repertoire largely
    because it was among the symbols repertoire from XCCS 1980,
    the Xerox corporate character standard,
    which was the primary source for the original math symbols.
    It is encoded as 042<sub>8</sub>/124<sub>8</sub> (i.e. 0x2254)
    in XCCS 1980, and yes that is an amazing numerical coincidence
    and was not deliberate.

    [Aside for the Unicode trivia buffs: What *other* Unicode
    math symbol (other than ASCII) has the same numerical value
    in Unicode *and* in XCCS 1980?]

    U+2A74 DOUBLE COLON EQUAL got into the standard much later,
    in Unicode 3.2, as part of the very large additional set
    of math symbols requested by AMS and the STIX publication
    group. Everything in the 2AXX block is part of that collection.

    There were several reasons for encoding it also as a unit,
    although I can assure you the UTC members debated that,
    along with the identity of a number of other operators
    and relations that could conceivably be represented by
    sequence of other characters.

    1. The precedent of U+2254 COLON EQUAL.

    2. The fact that AMS and STIX had asked for it as a unit.

    3. The need in computational algebraic systems to have
       operators as single characters where possible.
       
    Note, however, that the UTC *did* end up providing
    a compatibility decomposition for U+2A74 (and U+2A75
    and U+2A76, as well). It would have done the same for
    U+2254, but for the constraints of normalization
    stability.

    > In a sense they belong to the same class as "lower ten" being an
    > ALGOL character and an ALGOL meta-character.
    > What was the convincing argument for their inclusion?

    The arguments are usually not convincing to people looking
    for logical consistency, and have a large degree of
    historical contingency in them, because of the prior
    history of what they were designed to be compatible
    with.

    Note that the ALGOL assignment operator got in as
    a single encoded character, mostly because of its history
    in the XCCS character encoding, even though in ASCII
    program text it was always represented as a sequence
    of ":" and "=", whereas most C digraphic and trigraphic operators
    never did (and nobody wants them encoded that way):
    ++, --, <=, >=, !=, ||, &&, *=, +=, <<=, etc., etc.
    "==" is one of the few exceptions (cf. U+2A75), but
    that wasn't added to Unicode to stand for the C
    equality operator, but instead as part of the
    paradigmatic set "=", "==", and "===" in another
    context altogether.

    >
    > Leo
    >
    > PS. I've changed the proposed character name- as printed, it did not
    > descend as much as the subscript digits would, if at all, as
    > all-capital drum printers and tele-typewriters had very little space
    > under the baseline.

    That's a glyph design issue, and unlikely to be a convincing
    argument for distinguishing a "LOWER TEN" from a "SUBSCRIPT TEN".

    First, you have pointed out that the semantics of this character
    is for writing the decimal radix, and that has always notionally
    been a subscript 10 in mathematics and computer science. The fact
    that the all-capital drum printers and teletypes couldn't handle
    true subscripting is a limitation of that technology, and not
    an indication that the GOST 10859 standardizers actually had in mind
    a character whose identity wasn't a subscript 10 but was something
    entirely different.

    Second, many extant fonts covering Unicode compatibility subscript
    numbers deliberately design them as denominator glyphs, rather
    than as true (descending below the baseline) subscript glyphs.
    See:

    http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts

    So treating the distinction between "lower" denominator glyphs
    and subscript glyphs as a *character* distinction isn't a
    good idea for the standard.

    > PS. I Started writing a proposal and will post my humble attempt at it
    > (is .odf OK?) shortly.

    The docmeister could answer that. I suspect .pdf would be
    preferable, as that format is already used for posting many
    standards documents by both the UTC and SC2. .odf would probably
    have to be converted.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Jan 18 2008 - 15:04:44 CST