Re: Hexadecimal

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Aug 16 2003 - 16:14:13 EDT

  • Next message: Doug Ewell: "Re: Hexadecimal"

    John Cowan beat me to the punch with some of this, but anyway...

    Pim Blokland <pblokland at planet dot nl> wrote:

    >> Basically, thousands of implementations, for decades now,
    >> have been using ASCII 0x30..0x39, 0x41..0x46, 0x61..0x66 to
    >> implement hexadecimal numbers. That is also specified in
    >> more than a few programming language standards and other
    >> standards. Those characters map to Unicode U+0030..U+0039,
    >> U+0041..U+0046, U+0061..U+0066.
    >
    > That's not a good reason for deciding to not implement something in
    > the future.
    > If everybody thought like that, there would never have been a
    > Unicode.

    If the founding designers of Unicode had tried to disunify the letters A
    through F and a through f in this way, so that converters had to map the
    letter D in "Delta" differently from the D in "U+200D", there would not
    be a Unicode today.

    > Besides, your example is proof that the implementation can change;
    > has to change. Where applications could use 8-bit characters to
    > store hex digits in the old days, they now have to use 16-bit
    > characters to keep up with Unicode...

    This has nothing to do with creating clones of the letters A-F and a-f
    for use with hexadecimal numbers.

    >> There is also a HUGE semantic difference between D meaning the
    >> letter D
    >> and Roman numeral D meaning 500.
    >
    > and those have different code points! So you're saying Jill is
    > right, right?

    Not exactly. The character U+216E ROMAN NUMERAL FIVE HUNDRED came from
    an East Asian double-byte character set, and was carried over into
    Unicode for round-tripping reasons. It is a compatibility equivalent of
    U+0044.

    If such a legacy standard had separate characters for the hexadecimal
    digits 10 through 15, we'd probably see them in Unicode for the same
    reason. But none did.

    > You seem to define "meaning" differently than what we're talking
    > about here.
    > In the abbreviation "mm" the two m's have different meanings: the
    > first is "milli" and the second is "meter". No one is asking to
    > encode those two letters with different codepoints!
    > What we're talking about is different general categories, different
    > numeric values and even, oddly enough, different BiDi categories.
    > Doesn't that qualify for creating new characters?

    You could make a case for proposing numeric values of 10 through 15 to
    be added to U+0044 through U+0049 and U+0064 through U+0069, based on
    their undeniably widespread use as hexadecimal digits. (No, I don't
    want to get into a debate about the word "digit" implying "ten.") But
    the differences in the other categories are less convincing. Latin
    letters are L& (strong LTR) while the digits are EN (weak LTR), but you
    may have a difficult time finding a non-pathological context in which
    European numerals are legitimately used RTL.

    John is right. Any proposal to disunify standard, common uses of the
    characters in the Basic Latin block would require unimaginable volumes
    of existing data to be recoded. (Thanks to UTF-8, even the move from
    8-bit character sets to Unicode, which you cited earlier, didn't require
    this.) See the "Decimal Separator" example in the ISO "Principles and
    Procedures" document to see how this burden can override other,
    well-meaning motivations to disunify common characters:

    http://www.dkuug.dk/JTC1/SC2/WG2/docs/n2352r.pdf

    > On a related note, can anybody tell me why U+212A Kelvin sign was
    > put in the Unicode character set?
    > I have never seen any acknowledgement of this symbol anywhere in the
    > real world. (That is, using U+212A instead of U+004B.)

    Round-trip compatibility with East Asian legacy character sets, so
    nobody could say that data converted to Unicode and back had been
    "corrupted."

    > And even the UCD calls it a letter rather than a symbol. I'd expect
    > if it was put in for completeness, to complement the degrees
    > Fahrenheit and degree Celcius, it would have had the same category
    > as those two?

    The "degrees Celsius" and "degrees Fahrenheit" symbols (U+2103 and
    U+2109) are imaged as a degree sign followed by a letter. Neither could
    be considered equivalent to a letter by itself, as U+212A can.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sat Aug 16 2003 - 16:44:15 EDT