Re: General-purpose data structures, compression algorithms and the sorts and Unicode

From: Markus Scherer (
Date: Fri Feb 04 2005 - 13:48:51 CST

  • Next message: Hans Aberg: "Re: Surrogate points"

    Hans Aberg wrote:
    > My guess is that Unicode, unless I have misunderstood Unicode, will be used
    > in an environment where those things appear. Then Unicode should help
    > interfacing those features, rather than cutting them off by trying to do too
    > much in those quarters. Such a technique is typical of say modern OO
    > development, and is also used, for example, when structuring modern science.

    I think the general complaint here, by just about everyone, is
    1. You make statements about Unicode without knowing anything about it.
    2. You fail to acknowledge that Unicode is widely used and implemented,
        and although some (or even many) things could have been designed
        better, they work.

    I bet that Unicode is the most widely used character set ever, except for US-ASCII and ISO 8859-1.

    > To take one example from your list, UTF-16 was mentioned as giving better
    > data compression over UTF-8/32. I then remarked that much better data
    > compression can be achieved using data compression algorithms, and that the
    > data compression achievable using UTF-16 was relatively minor relative to
    > that. ...

    Of course - if compression is all you are looking at.

    If you need an encoding for processing, where random access and fast and unambiguous decoding are
    important, use one of the three encoding forms that the Unicode Standard defines.

    If you need a byte serialization, use one of the encoding schemes associated with the above, or one
    of the following ones, or invent your own

    If you want a measure of compression combined with fast text streaming, use something like SCSU
    (which is a sister standard to Unicode, not part of it directly) or BOCU-1 (which is defined in a
    Technical Note).

    If you want maximum compression and don't need to be able to process directly and can afford slow
    encoding and decoding, use zip/gzip/bzip2.


    Opinions expressed here may not reflect my company's positions unless otherwise noted.

    This archive was generated by hypermail 2.1.5 : Fri Feb 04 2005 - 13:59:04 CST