Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jun 04 2006 - 09:28:58 CDT

  • Next message: Philippe Verdy: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"

    From: "Hans Aberg" <haberg@math.su.se>
    > On 4 Jun 2006, at 03:53, Asmus Freytag wrote:
    >
    >> UTF-32 loses on all counts: it's so space inefficient that for
    >> large scale text processing it's swamped by cache misses,
    >
    > What do you have in your mind here?

    You have simply forgotten to think about what is "cahce misses". An important issue for local processing. This is related to computer technology, even if a modern processor can manage very large memory space.

    When handling very large voulme of text data in memory (for example in a word processor, or a text editor, and then performing repetitive transformations like automatic search/replace, or in a web server handling large volumes of page scripts, for example in a MediaWiki server like Wikipedia), compressing the data in memory gives very significant impovement in performance, simply because of reduced page swaps, and increaed memory page hits.

    But the most important improvement comes with networking, due to bandwidth constraints, notably on the server side, because the processing power is generally much enough to support the whole bandwidth, but the server is limited by its (costly) bandwidth.

    In that case, using UTF-32 just always wastes space, and decreases performance. A workaround is to use a downstream compression for output pages from the servers (using a stream compressor in the server output, independantly of the stream), but this does not reduce the number of pages accessed from the local database for reading scripts, and this does not help reducing the page loads in the processor cache.

    So for internal processing, even UTF-8 will cause significant increase of performance, even if it requires decoding it to get the codepoints when implementing Unicode normalizations: the normalizations need not be performed using a conversion buffer, it just requires a stream converter class that interprets the byte sequences and provides the codepoints on demand, without allocating more buffers for data conversion of complete texts.

    Another thing to consider is that a Unicode-compliant algorithm needs not be implemented by decoding code points from any UF-encoded stream. You can of course implement it using the equivalent UTF byte sequences (but for simplicity in the standard, it only describes the standard algorithms based on codepoints, independantly of the UTF actually used to represent the text).

    So you can perfectly implement the standard NF(K)C/D normalizations based only on UTF-8 byte sequences, in a fully conforming way. What is needed is just to preserve the codepoint identity independantly of their representation (as an UTF, or BOCU-1 or SCSU, or even using GB18030 as the base encoding, or even ISO 8859-1 if it contains all the necessary characters mapped to the equivalent Unicode codepoints!).



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 10:01:25 CDT