Re: Nicest UTF

From: Theo (delete@elfdata.com)
Date: Fri Dec 03 2004 - 11:36:28 CST

  • Next message: Mark Davis: "Re: Nicest UTF"

    > From: Asmus Freytag <asmusf@ix.netcom.com>

    >> I use ... and UTF-32 for most internal processing that I write
    >> myself. Let people say UTF-32 is wasteful if they want; I don't tend
    >> to
    >> store huge amounts of text in memory at once, so the overhead is much
    >> less important than one code unit per character.
    >
    >
    > For performance-critical applications on the other hand, you need to
    > use
    > whichever UTF gives you the correct balance in speed and average
    > storage
    > size for your data.
    >
    > If you have very large amounts of data, you'll be sensitive to cache
    > overruns. Enough so, that UTF-32 may be disqualified from the start.
    > I have encountered systems for which that was true.

    For both of these, I'd recommend UTF-8. Its compact, especially when
    parsing source code! which is mostly ASCII even if it contains other
    languages)... and its fast to process. Just use the byte processing
    functions.

    I've done natural language word processing functions on UTF-8 also, and
    its damn fast even there, even despite being case insensitive!

    My test was to do "word counting" to see word frequencies. I did this
    with UTF-8. All strings were entered as UTF-8 into a special "scanner"
    which I invented. All strings were entered as both uppercase and
    lowercase. The scanner would then have both uppercase and lowercase
    variants of UTF-8.

    The scanner however, only does byte (case sensitive) searching.

    So, despite it being UTF-8 case insensitive, it was totally blastingly
    fast. (One person reported counting words at 1MB/second of pure text,
    from within a mixed Basic / C environment). You'll need to keep in
    mind, that the counter must look up through thousands of words (Every
    single word its come across in the text), on every single word lookup.

    Anyhow, from my experience, UTF-8 is great for speed and RAM.



    This archive was generated by hypermail 2.1.5 : Fri Dec 03 2004 - 11:42:17 CST