Re: Hexadecimal

From: Jim Allan (jallan@smrtytrek.com)
Date: Sat Aug 16 2003 - 16:13:29 EDT

  • Next message: Peter Kirk: "Re: Hexadecimal"

    Pim Blockland posted:

    > Kenneth Whistler wote:
    >
    >> Basically, thousands of implementations, for decades now,
    >> have been using ASCII 0x30..0x39, 0x41..0x46, 0x61..0x66 to
    >> implement hexadecimal numbers. That is also specified in
    >> more than a few programming language standards and other
    >> standards. Those characters map to Unicode U+0030..U+0039,
    >> U+0041..U+0046, U+0061..U+0066.
    >
    > That's not a good reason for deciding to not implement something in
    > the future.
    > If everybody thought like that, there would never have been a
    > Unicode.

    You are taking Ken's statements out of context.

    Unicode did not attempt to change all of past practice, but to change
    parts of it and build on parts of it balancing the apparent value of the
    changes against the disruption they would cause.

    You have not provided a reason why the letters used as hex digits should
    be encoded separately for that particular use when they would make *no*
    difference in display.

    Unicode encodes characters, not meanings, with a very few exceptions,
    most of them for compatibility reasons and a few for word division reasons.

    > Besides, your example is proof that the implementation can change;
    > has to change. Where applications could use 8-bit characters to
    > store hex digits in the old days, they now have to use 16-bit
    > characters to keep up with Unicode...

    Are you actually arguing that because change happens, therefore any
    particular proposed change must be beneficial?

    In any case applications still use one character for hex digits (and
    decimal digits) if using UTF-8. Double-byte character sets were already
    using two bytes for the hex digits. (Mixed-byte character sets were not.)

    > and Jim Allen wrote:
    >> > What I mean is, it seems (to me) that there is a HUGE semantic
    > difference
    >> > between the hexadecimal digit thirteen, and the letter D.
    >>
    >> There is also a HUGE semantic difference between D meaning the
    > letter D
    >> and Roman numeral D meaning 500.
    >
    > and those have different code points! So you're saying Jill is
    > right, right?

    No.

    You are quoting out of context from an explanation as to why Unicode
    coded Roman numerals separately. See 14.3 at
    http://www.unicode.org/versions/Unicode4.0.0/ch14.pdf:

    << Number form characters are encoded solely for compatibility with
    existing standards. >>

    Also

    << Roman Numerals. The Roman numerals can be composed of sequences of
    the appropriate Latin letters. Upper- and lowercase variants of the
    Roman numerals through 12, plus L, C, D, and M, have been encoded for
    compatibility with East Asian standards. >>

    These were not encoded because the Unicode people thought they would be
    at all useful. They aren't at all useful.

    Most fonts don't support those characters and proabably most fonts never
    will.

    There is normally no reason to use them, unless you want to spoof people
    and cause difficulties in searches and have missing character glyphs or
    glyphs from another font in a different style from the main font appear
    when font changes are made.

    _D_ in Roman numerals is still the character _D_. People knew it was _D_
    when they wrote it and knew it was _D_ when they hand set type. They
    typed the _D_ key on typewriters. They typed the _D_ key on computer
    keyboards. And in Unicode they will mostly enter standard U+044 LATIN
    CAPITAL LETTER D, quite rightly, despite a needless alternate Roman
    numeral _D_ in some few fonts.

    Similarly they know that _D_ in hex notation is the letter _D_ given a
    special meaning in that context. Coding separately two meanings of the
    same character would not be helpful.

    People make enough errors in entering characters even when they can see
    a difference.

    > You seem to define "meaning" differently than what we're talking
    > about here.
    > In the abbreviation "mm" the two m's have different meanings: the
    > first is "milli" and the second is "meter". No one is asking to
    > encode those two letters with different codepoints!

    Why not?

    It is the same kind of difference.

    It is still _m_, just with a different meaning, just as the Greek
    character _pi_ used in geometry for the relationship between a diamenter
    and circumference is still the character _pi_, the same as _c_ used for
    the speed of light in "E=mc˛" is still the character _c_.

    Should particular semantic meanings for all characters encoded
    differently just because they are arithmetical or mathematical? The
    distinction in use appears in the context of the usage. Encoding a new
    character with the same appearance would indicate nothing extra.

    Computers can perform mathematics with Roman numerals or hex numbers
    perfectly well when they know they are Roman numerals or hex numbers
    without any special encoding.

    Anyone at any time in any descipline can assign a special meaning to a
    Latin letter without waiting for this meaning to be encoded in Unicode
    and should not expect that a clone of the character with that special
    meaning would ever be encoded in Unicode.

    > What we're talking about is different general categories, different
    > numeric values and even, oddly enough, different BiDi categories.
    > Doesn't that qualify for creating new characters?

    Not unless it would be *useful*. The Greek and Hebrew letters have
    numeric values also. Would it be useful to encode them all twice for
    that reason alone?

    In fact we *know* that when used for numeric values they still are the
    *same* characters with different semantics. Unicode encodes characters.

    What benefit to encode a character twice when current usage seldom
    bothers and confuses anyone.

    One might better encode decimal point period, decimal point comma
    separate from normal period and normal comma. One might better also
    encode abbreviation period separately from sentence-ending period. We
    could code right apostrophe separate from single high closing quotation
    mark.

    But Unicode doesn't.

    The fact that in an orthographic system certain symbols have multiple
    and inconsistant semantics is a fault of the system not the encoding.
    Change the system (say by demanding every hex digit have a dot over it
    or the setences end with a hollow circle) and then Unicode will have to
    follow suit. But as it is now Unicode adquately codes the orthographic
    system in use.

    And in general it is for computer systems to make things easy for the
    users, not more difficult by demanding the users enter symbols for
    particular use that make no difference whatsoever in print or on a
    screen (unless one views it in special mode).

    If a programming language needs a way to distinguish 25 hex from 25
    decimal, it should be by a method that humans can also see. Note, as
    this example shows, not only would you have to add duplicates for the
    some letters for the alphabet, but for the numeric digits. And you will
    presumably have to do this again for the digits for octal use since 10
    octal is 8 decimal. Then there is binary, such as 10010.

    And what about base 20 if we want to count in scores?

    You will need a separate set of characters for every base you want to
    encode. And you still won't able to tell them apart by looking at them.

    > On a related note, can anybody tell me why U+212A Kelvin sign was
    > put in the Unicode character set?
    > I have never seen any acknowledgement of this symbol anywhere in the
    > real world. (That is, using U+212A instead of U+004B.)
    > And even the UCD calls it a letter rather than a symbol. I'd expect
    > if it was put in for completeness, to complement the degrees
    > Fahrenheit and degree Celcius, it would have had the same category
    > as those two?

    U+212A comes from KS C 5601 standard encoding for Korean and IBM code
    page 944 for Korean and possibly for some other old East Asian standard(s).

    It appears to result from someone blindly including it as a Roman letter
    technical abbreviations in the Korean character set even though that set
    already had the entire standard 26-character Roman alphabet. So Unicode
    is stuck with it for compatibility

    But Unicode assigns U+212A a canonical decomposition to normal U+004B K.
    That means U+212A is considered to be a duplicate of normal U+004B K.
    See the conformance requirements in
    http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf notably C9 and
    C10. Applications can silently replace it with U+0048 and must not
    assume that another application may not silently replace it with U+0048.

    I see no point in ever using U+212A (except for spoofing) or retaining
    data exactly has encoded that has been converted from a code page that
    uses this character so that it can be converted back properly and any
    validation checksums and such will still be valid or some non-standard
    value for this character in a particular font will display properly.

    The character U+212A within Unicode is useless.

    Maybe it is time to deprecate some of these characters.

    Jim Allan



    This archive was generated by hypermail 2.1.5 : Sat Aug 16 2003 - 16:44:54 EDT