Re: Codespace Anxiety Redux

From: Asmus Freytag (
Date: Fri Nov 02 2007 - 03:28:59 CST

  • Next message: James Kass: "Re: Encoding Personal Use Ideographs"

    On 11/2/2007 1:48 AM, Jeroen Ruigrok van der Werven wrote:
    > Kenneth,
    > -On [20071101 22:46], Kenneth Whistler ( wrote:
    >> Once again, just in time for the holidays, the Unicode list
    >> has come around again to one of its perennial favorite topics:
    >> how 17 planes isn't enough codespace, how software will
    >> break when we "inevitably" run out of codes for characters,
    >> and what a shame it is to be stuck with such a limited
    >> and architecturally flawed construct, given all the 30 bezillion
    >> unencoded characters waiting to be encoded.
    > One thing I am interested in though is the following:
    > we know that in the current implementation of Unicode we have a lot of cruft
    > resulting from wrong data, typos and whatnot that never gets removed due to
    > the nature/charter of Unicode.
    They don't ever get removed because of the fundamental nature of
    character encoding.

    A character encoding is your key to interpret data encoded using that
    encoding. Or, in the case of a large character encoding like Unicode,
    it's perhaps better to think of it as a whole humongous ring of master keys.
    > Ultimately the desire will arise to take what we know is (mostly) correct in
    > Unicode, clear all the unwanted cruft and start fresh from that point.
    I'm not so sure. By throwing away some of your master keys, you make
    some data (permanently) undecipherable. You'd have to be extremely sure
    that a) no data that you are interested in contain the characters you
    consider 'cruft', and b) that your definition of cruft really has
    staying power (i.e. is objective, not subjective). Most speculation
    about 'cleaning up' has failed to take these simple, but fundamental
    issues into consideration. Consequently, nothing has progressed past
    idle chatter (this includes traffic on this list).
    > Has there been any thought given from within Unicode (in the broadest sense)
    > about this?
    Several things *are* possible (and not in any way violating the
    fundamental aspect of character encodings as master keys).

    1) Agreement might be reached on a *subset* of characters recommended
    for use in new documents, with the subset excluding anything that has
    become known to be not required for anything but the accurate
    representation of historical documents.

    2) A 'cleanup' mapping might be agreed upon, that can be used to clean
    up data that comes into an editing process (e.g. by cut&paste). The
    mapping would reflect the best knowledge as to what is a semantically
    neutral transformation that avoids the use of characters not in the list
    from item 1.

    3) The *presentation* of the list from item 1 could be improved such
    that users can select the proper character to use without having to be
    exposed to the (semantically irrelevant and often arbitrary) arrangement
    of the Unicode code space.

    I don't expect that any of these three developments will necessarily
    result in a universally agreed approach that covers the entire
    repertoire of Unicode. I rather expect that progress will be made
    piecemeal on these issues. I further suspect that what users find most
    helpful might be the third one in the list, but who knows.


    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 03:31:53 CST