Not enough code points (was: Re: 32'nd bit & UTF-8)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 14:27:09 CST

Next message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"

Previous message: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg continuted:

> >> Besides, even though Unicode has declared to never use more than 21
> >> bits, in the track record, Unicode has reneged on such promises. It
> >> might be prudent to knock down a full 32-bit encoding, declaring
> >> UTF-8/32 to be subsets of that.
> >
> > I suppose the "promise" that you are referring to, on which Unicode
> > "reneged," was the original 16-bit design that was extended with the use
> > of surrogate pairs.
>
> Right.
>
> > The difference between finding 65,000 things that need to be encoded and
> > finding 1.1 million things that need to be encoded is the difference
> > between night and day.

Doug Ewell was correct.

This is another one of those Alligator in the Sewer urban myths
which seems to circulate as truth, particularly amongst the Unix
community, unchecked and uncheckable, apparently. Since it is
simply an article of faith, it seems immune to any demonstration of
its falsity.

> I can only refer to the development of the Bison parser generator
> <http://gnu.org>. There, the number of tokens, states, etc where ofthen
> limted to 2^15. But it turns out that people want, in vfiew of more powerful
> computers to plug in larger and larger grammars. One can then plug in really
> large machine genrated grammars. This way, one might plug in grammars with
> millions of tokens, for example. So these lower limits are being now
> changed.

This is irrelevant.

32 bits proved not to be enough for IP addresses. So?

3 digits + 3 letters proved not to be enough for California vehicle
license plates. So?

Citing someone's estimate of potential growth of some numerical
mechanism in one context as being an underestimate, doesn't
demonstrate that someone's else's estimate in some other area
is also off. Unless you want to make the ridiculous claim that
no estimate can ever be accurate, and all numbers are doomed to
forever range upwards beyond any limit we establish for them.

> So, as long as these Unicode encodings will only be used for human
> enumeration characters, 1 million is perhaps well within the boundaries.

Not perhaps. *Is*

See:

http://www.unicode.org/roadmaps/

Also, please note that Unicode 4.0 has 878,083 undesignated
code points in it. The UTC and JTC1/SC2/WG2 between them,
working flat out, are now managing to add approximately 1000
encoded characters per year. The trend in the future is likely
to be *down*, rather than up, because the committees are running
out of well-documented, non-controversial sets of characters
for scripts or symbol collections.

You do the math.

> But
> if somebody comes up with some clever machine enunciation, then it might be
> broken.

In other words, if someone hijacks the character encoding and tries
to use it to do something completely inappropriate with it, like
store RFID's, ISBN's, and citizen ID numbers in code points, we
might run out of numbers? Well, yes. But do you think the character
encoding committees are going to be stupid enough to do that
in your or my lifetimes?

--Ken

Next message: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Previous message: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:27:56 CST