From: Asmus Freytag (firstname.lastname@example.org)
Date: Fri Nov 02 2007 - 03:28:59 CST
On 11/2/2007 1:48 AM, Jeroen Ruigrok van der Werven wrote:
> -On [20071101 22:46], Kenneth Whistler (email@example.com) wrote:
>> Once again, just in time for the holidays, the Unicode list
>> has come around again to one of its perennial favorite topics:
>> how 17 planes isn't enough codespace, how software will
>> break when we "inevitably" run out of codes for characters,
>> and what a shame it is to be stuck with such a limited
>> and architecturally flawed construct, given all the 30 bezillion
>> unencoded characters waiting to be encoded.
> One thing I am interested in though is the following:
> we know that in the current implementation of Unicode we have a lot of cruft
> resulting from wrong data, typos and whatnot that never gets removed due to
> the nature/charter of Unicode.
They don't ever get removed because of the fundamental nature of
A character encoding is your key to interpret data encoded using that
encoding. Or, in the case of a large character encoding like Unicode,
it's perhaps better to think of it as a whole humongous ring of master keys.
> Ultimately the desire will arise to take what we know is (mostly) correct in
> Unicode, clear all the unwanted cruft and start fresh from that point.
I'm not so sure. By throwing away some of your master keys, you make
some data (permanently) undecipherable. You'd have to be extremely sure
that a) no data that you are interested in contain the characters you
consider 'cruft', and b) that your definition of cruft really has
staying power (i.e. is objective, not subjective). Most speculation
about 'cleaning up' has failed to take these simple, but fundamental
issues into consideration. Consequently, nothing has progressed past
idle chatter (this includes traffic on this list).
> Has there been any thought given from within Unicode (in the broadest sense)
> about this?
Several things *are* possible (and not in any way violating the
fundamental aspect of character encodings as master keys).
1) Agreement might be reached on a *subset* of characters recommended
for use in new documents, with the subset excluding anything that has
become known to be not required for anything but the accurate
representation of historical documents.
2) A 'cleanup' mapping might be agreed upon, that can be used to clean
up data that comes into an editing process (e.g. by cut&paste). The
mapping would reflect the best knowledge as to what is a semantically
neutral transformation that avoids the use of characters not in the list
from item 1.
3) The *presentation* of the list from item 1 could be improved such
that users can select the proper character to use without having to be
exposed to the (semantically irrelevant and often arbitrary) arrangement
of the Unicode code space.
I don't expect that any of these three developments will necessarily
result in a universally agreed approach that covers the entire
repertoire of Unicode. I rather expect that progress will be made
piecemeal on these issues. I further suspect that what users find most
helpful might be the third one in the list, but who knows.
This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 03:31:53 CST