Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (haberg@math.su.se)
Date: Mon Jun 05 2006 - 05:05:57 CDT

  • Next message: Erkki Kolehmainen: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"

    On 5 Jun 2006, at 11:33, Philippe Verdy wrote:

    > I don't presume which encoding is better for all.

    This is what I too said.

    > Application and networking tuning (andinteroperability) is as much
    > important! UTF-8 is excellent for interoperability in heterogeneous
    > environment, and is supported by the most important number of
    > protocols. UTF-32 is not, and it wastes space except for local
    > handling of small quantities of texts, which is otherwise
    > represented and stored or transmitted differently, only because
    > it's more convenient for interoperation.

    This is also what I said: UTF-32 may be favored internally in a
    program for the sake of alignment and speed. UTF-8 is fine for text-
    to-ext communications.

    > But don't forget databases. They are stored on disks, and disk
    > accessb is always too slow. what you read from disk will end into
    > memory and will swap to disk. If you can't handle the strict
    > natively in memory exactly the way it is stored, the swapping to
    > disk will require more disk space.

    There I said that if data compression is a major objective, do not
    rely on a character encoding to do the job, but seek out more
    efficient compression methods.

    > From: "Hans Aberg" <haberg@math.su.se>
    >> And here Moore's law comes into play again, as RAM becomes
    >> increasingly cheap.
    >
    > Moore's law has nothing to do here. Even though RAM is getting
    > lower per megabyte, the modern programs use more memory and handle
    > more data.

    My focus was the issue, where I wanted to find out why the OP felt
    "cache misses" excluded UTF-32 in favor of UTF-8. There, I think,
    this is a problem only if you have too little RAM in your computer,
    which the Moore's law say that soon enough will be available. If you
    have too little RAM on a virtual memory based computer, really
    nothing will help any of your program running, but to get enough with
    RAM, as the faster parts of the computer will spend time waiting for
    page swaps to occur.

    If you have enough with RAM, UTF-32 should be faster than UTF-8
    internally in a program, as no alignemnets need to be computed. But
    only proper profiling for each given program can really tell.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Mon Jun 05 2006 - 05:41:37 CDT