From: Jim Allan (firstname.lastname@example.org)
Date: Sat Aug 16 2003 - 16:13:29 EDT
Pim Blockland posted:
> Kenneth Whistler wote:
>> Basically, thousands of implementations, for decades now,
>> have been using ASCII 0x30..0x39, 0x41..0x46, 0x61..0x66 to
>> implement hexadecimal numbers. That is also specified in
>> more than a few programming language standards and other
>> standards. Those characters map to Unicode U+0030..U+0039,
>> U+0041..U+0046, U+0061..U+0066.
> That's not a good reason for deciding to not implement something in
> the future.
> If everybody thought like that, there would never have been a
You are taking Ken's statements out of context.
Unicode did not attempt to change all of past practice, but to change
parts of it and build on parts of it balancing the apparent value of the
changes against the disruption they would cause.
You have not provided a reason why the letters used as hex digits should
be encoded separately for that particular use when they would make *no*
difference in display.
Unicode encodes characters, not meanings, with a very few exceptions,
most of them for compatibility reasons and a few for word division reasons.
> Besides, your example is proof that the implementation can change;
> has to change. Where applications could use 8-bit characters to
> store hex digits in the old days, they now have to use 16-bit
> characters to keep up with Unicode...
Are you actually arguing that because change happens, therefore any
particular proposed change must be beneficial?
In any case applications still use one character for hex digits (and
decimal digits) if using UTF-8. Double-byte character sets were already
using two bytes for the hex digits. (Mixed-byte character sets were not.)
> and Jim Allen wrote:
>> > What I mean is, it seems (to me) that there is a HUGE semantic
>> > between the hexadecimal digit thirteen, and the letter D.
>> There is also a HUGE semantic difference between D meaning the
> letter D
>> and Roman numeral D meaning 500.
> and those have different code points! So you're saying Jill is
> right, right?
You are quoting out of context from an explanation as to why Unicode
coded Roman numerals separately. See 14.3 at
<< Number form characters are encoded solely for compatibility with
existing standards. >>
<< Roman Numerals. The Roman numerals can be composed of sequences of
the appropriate Latin letters. Upper- and lowercase variants of the
Roman numerals through 12, plus L, C, D, and M, have been encoded for
compatibility with East Asian standards. >>
These were not encoded because the Unicode people thought they would be
at all useful. They aren't at all useful.
Most fonts don't support those characters and proabably most fonts never
There is normally no reason to use them, unless you want to spoof people
and cause difficulties in searches and have missing character glyphs or
glyphs from another font in a different style from the main font appear
when font changes are made.
_D_ in Roman numerals is still the character _D_. People knew it was _D_
when they wrote it and knew it was _D_ when they hand set type. They
typed the _D_ key on typewriters. They typed the _D_ key on computer
keyboards. And in Unicode they will mostly enter standard U+044 LATIN
CAPITAL LETTER D, quite rightly, despite a needless alternate Roman
numeral _D_ in some few fonts.
Similarly they know that _D_ in hex notation is the letter _D_ given a
special meaning in that context. Coding separately two meanings of the
same character would not be helpful.
People make enough errors in entering characters even when they can see
> You seem to define "meaning" differently than what we're talking
> about here.
> In the abbreviation "mm" the two m's have different meanings: the
> first is "milli" and the second is "meter". No one is asking to
> encode those two letters with different codepoints!
It is the same kind of difference.
It is still _m_, just with a different meaning, just as the Greek
character _pi_ used in geometry for the relationship between a diamenter
and circumference is still the character _pi_, the same as _c_ used for
the speed of light in "E=mc˛" is still the character _c_.
Should particular semantic meanings for all characters encoded
differently just because they are arithmetical or mathematical? The
distinction in use appears in the context of the usage. Encoding a new
character with the same appearance would indicate nothing extra.
Computers can perform mathematics with Roman numerals or hex numbers
perfectly well when they know they are Roman numerals or hex numbers
without any special encoding.
Anyone at any time in any descipline can assign a special meaning to a
Latin letter without waiting for this meaning to be encoded in Unicode
and should not expect that a clone of the character with that special
meaning would ever be encoded in Unicode.
> What we're talking about is different general categories, different
> numeric values and even, oddly enough, different BiDi categories.
> Doesn't that qualify for creating new characters?
Not unless it would be *useful*. The Greek and Hebrew letters have
numeric values also. Would it be useful to encode them all twice for
that reason alone?
In fact we *know* that when used for numeric values they still are the
*same* characters with different semantics. Unicode encodes characters.
What benefit to encode a character twice when current usage seldom
bothers and confuses anyone.
One might better encode decimal point period, decimal point comma
separate from normal period and normal comma. One might better also
encode abbreviation period separately from sentence-ending period. We
could code right apostrophe separate from single high closing quotation
But Unicode doesn't.
The fact that in an orthographic system certain symbols have multiple
and inconsistant semantics is a fault of the system not the encoding.
Change the system (say by demanding every hex digit have a dot over it
or the setences end with a hollow circle) and then Unicode will have to
follow suit. But as it is now Unicode adquately codes the orthographic
system in use.
And in general it is for computer systems to make things easy for the
users, not more difficult by demanding the users enter symbols for
particular use that make no difference whatsoever in print or on a
screen (unless one views it in special mode).
If a programming language needs a way to distinguish 25 hex from 25
decimal, it should be by a method that humans can also see. Note, as
this example shows, not only would you have to add duplicates for the
some letters for the alphabet, but for the numeric digits. And you will
presumably have to do this again for the digits for octal use since 10
octal is 8 decimal. Then there is binary, such as 10010.
And what about base 20 if we want to count in scores?
You will need a separate set of characters for every base you want to
encode. And you still won't able to tell them apart by looking at them.
> On a related note, can anybody tell me why U+212A Kelvin sign was
> put in the Unicode character set?
> I have never seen any acknowledgement of this symbol anywhere in the
> real world. (That is, using U+212A instead of U+004B.)
> And even the UCD calls it a letter rather than a symbol. I'd expect
> if it was put in for completeness, to complement the degrees
> Fahrenheit and degree Celcius, it would have had the same category
> as those two?
U+212A comes from KS C 5601 standard encoding for Korean and IBM code
page 944 for Korean and possibly for some other old East Asian standard(s).
It appears to result from someone blindly including it as a Roman letter
technical abbreviations in the Korean character set even though that set
already had the entire standard 26-character Roman alphabet. So Unicode
is stuck with it for compatibility
But Unicode assigns U+212A a canonical decomposition to normal U+004B K.
That means U+212A is considered to be a duplicate of normal U+004B K.
See the conformance requirements in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf notably C9 and
C10. Applications can silently replace it with U+0048 and must not
assume that another application may not silently replace it with U+0048.
I see no point in ever using U+212A (except for spoofing) or retaining
data exactly has encoded that has been converted from a code page that
uses this character so that it can be converted back properly and any
validation checksums and such will still be valid or some non-standard
value for this character in a particular font will display properly.
The character U+212A within Unicode is useless.
Maybe it is time to deprecate some of these characters.
This archive was generated by hypermail 2.1.5 : Sat Aug 16 2003 - 16:44:54 EDT