From: Theo (delete@elfdata.com)
Date: Fri Dec 03 2004 - 11:36:28 CST
> From: Asmus Freytag <asmusf@ix.netcom.com>
>> I use ... and UTF-32 for most internal processing that I write
>> myself. Let people say UTF-32 is wasteful if they want; I don't tend
>> to
>> store huge amounts of text in memory at once, so the overhead is much
>> less important than one code unit per character.
>
>
> For performance-critical applications on the other hand, you need to
> use
> whichever UTF gives you the correct balance in speed and average
> storage
> size for your data.
>
> If you have very large amounts of data, you'll be sensitive to cache
> overruns. Enough so, that UTF-32 may be disqualified from the start.
> I have encountered systems for which that was true.
For both of these, I'd recommend UTF-8. Its compact, especially when
parsing source code! which is mostly ASCII even if it contains other
languages)... and its fast to process. Just use the byte processing
functions.
I've done natural language word processing functions on UTF-8 also, and
its damn fast even there, even despite being case insensitive!
My test was to do "word counting" to see word frequencies. I did this
with UTF-8. All strings were entered as UTF-8 into a special "scanner"
which I invented. All strings were entered as both uppercase and
lowercase. The scanner would then have both uppercase and lowercase
variants of UTF-8.
The scanner however, only does byte (case sensitive) searching.
So, despite it being UTF-8 case insensitive, it was totally blastingly
fast. (One person reported counting words at 1MB/second of pure text,
from within a mixed Basic / C environment). You'll need to keep in
mind, that the counter must look up through thousands of words (Every
single word its come across in the text), on every single word lookup.
Anyhow, from my experience, UTF-8 is great for speed and RAM.
This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 11:42:17 CST