Not enough code points (was: Re: 32'nd bit & UTF-8)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 14:27:09 CST

  • Next message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"

    Hans Aberg continuted:

    > >> Besides, even though Unicode has declared to never use more than 21
    > >> bits, in the track record, Unicode has reneged on such promises. It
    > >> might be prudent to knock down a full 32-bit encoding, declaring
    > >> UTF-8/32 to be subsets of that.
    > >
    > > I suppose the "promise" that you are referring to, on which Unicode
    > > "reneged," was the original 16-bit design that was extended with the use
    > > of surrogate pairs.
    >
    > Right.
    >
    > > The difference between finding 65,000 things that need to be encoded and
    > > finding 1.1 million things that need to be encoded is the difference
    > > between night and day.

    Doug Ewell was correct.

    This is another one of those Alligator in the Sewer urban myths
    which seems to circulate as truth, particularly amongst the Unix
    community, unchecked and uncheckable, apparently. Since it is
    simply an article of faith, it seems immune to any demonstration of
    its falsity.

    > I can only refer to the development of the Bison parser generator
    > <http://gnu.org>. There, the number of tokens, states, etc where ofthen
    > limted to 2^15. But it turns out that people want, in vfiew of more powerful
    > computers to plug in larger and larger grammars. One can then plug in really
    > large machine genrated grammars. This way, one might plug in grammars with
    > millions of tokens, for example. So these lower limits are being now
    > changed.

    This is irrelevant.

    32 bits proved not to be enough for IP addresses. So?

    3 digits + 3 letters proved not to be enough for California vehicle
    license plates. So?

    Citing someone's estimate of potential growth of some numerical
    mechanism in one context as being an underestimate, doesn't
    demonstrate that someone's else's estimate in some other area
    is also off. Unless you want to make the ridiculous claim that
    no estimate can ever be accurate, and all numbers are doomed to
    forever range upwards beyond any limit we establish for them.

    > So, as long as these Unicode encodings will only be used for human
    > enumeration characters, 1 million is perhaps well within the boundaries.

    Not perhaps. *Is*

    See:

    http://www.unicode.org/roadmaps/

    Also, please note that Unicode 4.0 has 878,083 undesignated
    code points in it. The UTC and JTC1/SC2/WG2 between them,
    working flat out, are now managing to add approximately 1000
    encoded characters per year. The trend in the future is likely
    to be *down*, rather than up, because the committees are running
    out of well-documented, non-controversial sets of characters
    for scripts or symbol collections.

    You do the math.
     
    > But
    > if somebody comes up with some clever machine enunciation, then it might be
    > broken.

    In other words, if someone hijacks the character encoding and tries
    to use it to do something completely inappropriate with it, like
    store RFID's, ISBN's, and citizen ID numbers in code points, we
    might run out of numbers? Well, yes. But do you think the character
    encoding committees are going to be stupid enough to do that
    in your or my lifetimes?

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:27:56 CST