Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Hans Aberg (haberg@math.su.se)
Date: Sun Jun 04 2006 - 11:50:45 CDT

  • Next message: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"

    On 4 Jun 2006, at 16:28, Philippe Verdy wrote:

    >>> UTF-32 loses on all counts: it's so space inefficient that for
    >>> large scale text processing it's swamped by cache misses,
    >>
    >> What do you have in your mind here?
    >
    > You have simply forgotten to think about what is "cahce misses". An
    > important issue for local processing. This is related to computer
    > technology, even if a modern processor can manage very large memory
    > space.
    >
    > When handling very large voulme of text data in memory (for example
    > in a word processor, or a text editor, and then performing
    > repetitive transformations like automatic search/replace, or in a
    > web server handling large volumes of page scripts, for example in a
    > MediaWiki server like Wikipedia), compressing the data in memory
    > gives very significant impovement in performance, simply because of
    > reduced page swaps, and increaed memory page hits.

    For efficient virtual memory handling, the active pages must be kept
    in RAM, or else there is a slowdown of a factor of one hundred or so,
    the ratio between the RAM memory and hard disk bus speeds.

    It is quite common personal computers have too little RAM in them,
    causing this problem not only in text processing, but in any program
    running.

    > In that case, using UTF-32 just always wastes space, and decreases
    > performance.

    So this is only true if one has too little RAM.

    > So for internal processing, even UTF-8 will cause significant
    > increase of performance, even if it requires decoding it to get the
    > codepoints when implementing Unicode normalizations: the
    > normalizations need not be performed using a conversion buffer, it
    > just requires a stream converter class that interprets the byte
    > sequences and provides the codepoints on demand, without allocating
    > more buffers for data conversion of complete texts.

    And here Moore's law comes into play again, as RAM becomes
    increasingly cheap.

    > But the most important improvement comes with networking, due to
    > bandwidth constraints, notably on the server side, because the
    > processing power is generally much enough to support the whole
    > bandwidth, but the server is limited by its (costly) bandwidth.

    And this seems to be the data compression issue, in which case it
    might be prudent to find better algorithms for just that job.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 11:56:02 CDT