From: Hans Aberg (email@example.com)
Date: Sun Jun 04 2006 - 11:50:45 CDT
On 4 Jun 2006, at 16:28, Philippe Verdy wrote:
>>> UTF-32 loses on all counts: it's so space inefficient that for
>>> large scale text processing it's swamped by cache misses,
>> What do you have in your mind here?
> You have simply forgotten to think about what is "cahce misses". An
> important issue for local processing. This is related to computer
> technology, even if a modern processor can manage very large memory
> When handling very large voulme of text data in memory (for example
> in a word processor, or a text editor, and then performing
> repetitive transformations like automatic search/replace, or in a
> web server handling large volumes of page scripts, for example in a
> MediaWiki server like Wikipedia), compressing the data in memory
> gives very significant impovement in performance, simply because of
> reduced page swaps, and increaed memory page hits.
For efficient virtual memory handling, the active pages must be kept
in RAM, or else there is a slowdown of a factor of one hundred or so,
the ratio between the RAM memory and hard disk bus speeds.
It is quite common personal computers have too little RAM in them,
causing this problem not only in text processing, but in any program
> In that case, using UTF-32 just always wastes space, and decreases
So this is only true if one has too little RAM.
> So for internal processing, even UTF-8 will cause significant
> increase of performance, even if it requires decoding it to get the
> codepoints when implementing Unicode normalizations: the
> normalizations need not be performed using a conversion buffer, it
> just requires a stream converter class that interprets the byte
> sequences and provides the codepoints on demand, without allocating
> more buffers for data conversion of complete texts.
And here Moore's law comes into play again, as RAM becomes
> But the most important improvement comes with networking, due to
> bandwidth constraints, notably on the server side, because the
> processing power is generally much enough to support the whole
> bandwidth, but the server is limited by its (costly) bandwidth.
And this seems to be the data compression issue, in which case it
might be prudent to find better algorithms for just that job.
This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 11:56:02 CDT