From: Asmus Freytag (email@example.com)
Date: Sat Jun 03 2006 - 20:53:28 CDT
On 6/2/2006 4:24 PM, Kenneth Whistler wrote:
> Theodore Smith said:
>> and do it CPU efficiently and far more space efficiently.
For source data, such as HTML, which contain ASCII-only syntax element
with high frequency, UTF-8 is indeed almost always the most space
efficient encoding. This is particularly true for the wasteful HTML or
XML generated by some converters.
With that assumption in mind, any tasks that involve parsing such data
(which is really a task of parsing ASCII) is indeed best handled in
UTF-8. The space, bandwidth and cache savings will all add up. In
addition, such data is intended for interchange, and UTF-8 is in some
ways more robust there due to its lack of endian issues.
All nice advantages. On the other hand, the minute you do text
processing on the actual text data, such as morphological analysis, case
transformation, linguistically aware search, etc. you will need to
perform an implicit conversion to integral character values in order to
get at their properties, which you will need to drive your algorithm.
Given that, and given that the text portions (on average over all
languages) requires more than 1, and probably around 2 bytes / character
in UTF-8, the savings over processing data that is already UTF-16 are
much less, or even non-existant. (English only data is a very obvious
exception - but this discussion is intended to address the average case).
Therefore,if you have UTF-16 data, it never makes sense to convert them
to UTF-8 for processing, but the other way around may be attractive.
There are techniques that allow property lookup for UTF-16 in a way that
sharply reduces the penalty for the (statistically rare) surrogate
pairs, so that UTF-16 is both space and CPU efficient for tasks that
need to be aware of character boundaries.
You correctly note that character cluster boundaries (or generically
text element boundaries) affect text processing in all encoding forms,
therefore they drop out in the comparison.
UTF-32 loses on all counts: it's so space inefficient that for large
scale text processing it's swamped by cache misses, and the slight gain
in efficiency for accessing character property values matters only for
selected text corpora, such as cuneiform etc, that are entirely off the
BMP. Therfore, if you need to perform more than one operation on UTF-32
or hold large data in memory, it almost always pays to convert it to
some other encoding form - UTF-16 being the easier conversion.
For any situation where large data volumes need to be scanned, with
minimal processing, it pays to work in the format the data happens to be
in, as conversion costs can't be recovered.
For (very) small data volumes, it does not matter, as small data sets
are cheap to begin with. Here it would be foolish to work in any
encoding form other than what your preferred library or OS happens to
support with a rich set of APIs. The relative costs of conversion may be
high, but the absolute costs are low and you won't recover the
investment of creating your own libraries.
For extensive processing on (very) large texts, particularly stored
texts, rather than streamed, the relative costs of necessary conversions
may be recoverable, and the choice of encoding form might enter into the
overall optimization of your system.
Like all optimizations, you would consider taking action only after
you've researched enough sample data to know where your bottlenecks are.
Therefore to write:
> Therefor, I win the discussion. Thank you :)
is not only premature, but exhibits a lack of understanding of the true
This archive was generated by hypermail 2.1.5 : Sat Jun 03 2006 - 21:20:19 CDT