Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )

From: Asmus Freytag (
Date: Sat Jun 03 2006 - 20:53:28 CDT

  • Next message: Donald Z. Osborn: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"

    On 6/2/2006 4:24 PM, Kenneth Whistler wrote:
    > Theodore Smith said:
    >> and do it CPU efficiently and far more space efficiently.
    For source data, such as HTML, which contain ASCII-only syntax element
    with high frequency, UTF-8 is indeed almost always the most space
    efficient encoding. This is particularly true for the wasteful HTML or
    XML generated by some converters.

    With that assumption in mind, any tasks that involve parsing such data
    (which is really a task of parsing ASCII) is indeed best handled in
    UTF-8. The space, bandwidth and cache savings will all add up. In
    addition, such data is intended for interchange, and UTF-8 is in some
    ways more robust there due to its lack of endian issues.

    All nice advantages. On the other hand, the minute you do text
    processing on the actual text data, such as morphological analysis, case
    transformation, linguistically aware search, etc. you will need to
    perform an implicit conversion to integral character values in order to
    get at their properties, which you will need to drive your algorithm.
    Given that, and given that the text portions (on average over all
    languages) requires more than 1, and probably around 2 bytes / character
    in UTF-8, the savings over processing data that is already UTF-16 are
    much less, or even non-existant. (English only data is a very obvious
    exception - but this discussion is intended to address the average case).

    Therefore,if you have UTF-16 data, it never makes sense to convert them
    to UTF-8 for processing, but the other way around may be attractive.

    There are techniques that allow property lookup for UTF-16 in a way that
    sharply reduces the penalty for the (statistically rare) surrogate
    pairs, so that UTF-16 is both space and CPU efficient for tasks that
    need to be aware of character boundaries.

    You correctly note that character cluster boundaries (or generically
    text element boundaries) affect text processing in all encoding forms,
    therefore they drop out in the comparison.

    UTF-32 loses on all counts: it's so space inefficient that for large
    scale text processing it's swamped by cache misses, and the slight gain
    in efficiency for accessing character property values matters only for
    selected text corpora, such as cuneiform etc, that are entirely off the
    BMP. Therfore, if you need to perform more than one operation on UTF-32
    or hold large data in memory, it almost always pays to convert it to
    some other encoding form - UTF-16 being the easier conversion.

    For any situation where large data volumes need to be scanned, with
    minimal processing, it pays to work in the format the data happens to be
    in, as conversion costs can't be recovered.

    For (very) small data volumes, it does not matter, as small data sets
    are cheap to begin with. Here it would be foolish to work in any
    encoding form other than what your preferred library or OS happens to
    support with a rich set of APIs. The relative costs of conversion may be
    high, but the absolute costs are low and you won't recover the
    investment of creating your own libraries.

    For extensive processing on (very) large texts, particularly stored
    texts, rather than streamed, the relative costs of necessary conversions
    may be recoverable, and the choice of encoding form might enter into the
    overall optimization of your system.

    Like all optimizations, you would consider taking action only after
    you've researched enough sample data to know where your bottlenecks are.
    Therefore to write:

    > Therefor, I win the discussion. Thank you :)

    is not only premature, but exhibits a lack of understanding of the true


    This archive was generated by hypermail 2.1.5 : Sat Jun 03 2006 - 21:20:19 CDT