Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (haberg@math.su.se)
Date: Sun Jun 04 2006 - 07:21:44 CDT

  • Next message: Hans Aberg: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"

    On 4 Jun 2006, at 03:53, Asmus Freytag wrote:

    > UTF-32 loses on all counts: it's so space inefficient that for
    > large scale text processing it's swamped by cache misses,

    What do you have in your mind here?

    > and the slight gain in efficiency for accessing character property
    > values matters only for selected text corpora, such as cuneiform
    > etc, that are entirely off the BMP.

    This does just say that for character sets confined to a particular
    region, an encoding optimizing that is more efficient, though it will
    loose out in general use. It might be better choosing a more
    efficient optimizing method than a particular legacy encoding.

    > Therfore, if you need to perform more than one operation on UTF-32
    > or hold large data in memory, it almost always pays to convert it
    > to some other encoding form - UTF-16 being the easier conversion.

    I am not sure what you have in your mind here: With modern use of
    virtual memory, the OS emulates a large data space. For 32-bit
    computers, this is typically 2^31 bytes (or words), but these are now
    on the way out, in favor of 64-bit computers with even larger address
    space.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 07:26:49 CDT