RE: Unicode forms for internal storage

From: Elliotte Rusty Harold (
Date: Tue Jan 20 2004 - 14:41:13 EST

  • Next message: Philippe Verdy: "Re: Unicode forms for internal storage"

    At 10:26 AM -0800 1/20/04, Mike Ayers wrote:

             BZZZT! Sorry, thanks for playing. You can't get the
    advantages of both with no drawbacks. Given the octets 0x5B5B, how
    would you know if you had "[[" or a Chinese character?

    Actually, it looks like SCSU may do exactly that. If I'm
    understanding the algorithms, it actually encodes most BMP characters
    in a single byte, compressing quite a bit better than my naive idea
    to switch between UTF-8 and UTF-16.

    All schemes I've seen do involve some sort of flag characters in the
    data stream to switch between different code ranges. As long as you
    can keep the number of flag characters added down below the savings,
    you're good to go. My original idea was to simply use a null to
    switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.

    Of course, neither of those schemes will compress truly random data,
    but most data isn't random.

    > However, I would like the translation into and out of this format to
    > be at least as fast as the translation between UTF-8 and UTF-16 the
    > class is currently performing on every call to setValue and getValue,
    > ideally faster.

             Hmmm - again, this may be asking for too much. The
    UTF-8/UTF-16 transform is pretty simple. Is it bogging you down?

    It is a noticeable point in my profiling. I really did have to make a
    choice between speed and space here. According to it looks like SCSU is
    faster for a lot of languages but 10-25% slower for English, French
    and Japanese than the UTF-8/UTF-16 conversion.

             If your application will use much more of European or
    non-European languages, then just use UTF-8 or UTF-16 respectively,
    as you won't really lose much space that way.

    This is a class library which is relatively language neutral. If a
    Chinese programmer uses it, I'd expect they'd have a lot of data in
    Chinese. So far most of the adoption that I know about is in the
    Americas and Europe, but there's no reason it has to stay that way,
    especially if I can reduce the footprint for CJK text.

    If space usage is random/indeterminate/evenly distributed, then,
    assuming that any given string is primarily in a single language, a
    TLV type discriminating between UTF-8 and UTF-16 should do nicely.
    Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16)
    and the length, in octets, of the string (therefore max of 32,767
    octets per string, which shouldn't ordinarily be a problem).

    That would be a problem. I definitely cannot rule out long strings,
    where long is quite a bit larger than 32K.

       Elliotte Rusty Harold
       Effective XML (Addison-Wesley, 2003)           

    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:36:26 EST