RE: Unicode forms for internal storage

From: Mike Ayers (
Date: Tue Jan 20 2004 - 13:26:30 EST

  • Next message: John Jenkins: "Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"

    > Last night it occurred to me it might be possible to design an
    > internal storage format for this class which had better memory usage
    > characteristics. In particular I'd like ASCII data to occupy only a
    > single byte, and all other BMP characters from 128 to 65535 to occupy
    > only two bytes. Non-BMP characters could be stored in surrogate pairs.

            BZZZT! Sorry, thanks for playing. You can't get the advantages of
    both with no drawbacks. Given the octets 0x5B5B, how would you know if you
    had "[[" or a Chinese character?

    > 3. This is all completely private to one class. No data in this form
    > will be passed on the wire. None will be exposed via the public API
    > which is completely based on Java strings (that is, UTF-16).

            Good idea. We have too many external encodings anyway.

    > However, I would like the translation into and out of this format to
    > be at least as fast as the translation between UTF-8 and UTF-16 the
    > class is currently performing on every call to setValue and getValue,
    > ideally faster.

            Hmmm - again, this may be asking for too much. The UTF-8/UTF-16
    transform is pretty simple. Is it bogging you down?

    > Has anyone done any work on Unicode formats for this use-case? Does
    > anyone have any references or ideas to share?

            If your application will use much more of European or non-European
    languages, then just use UTF-8 or UTF-16 respectively, as you won't really
    lose much space that way. If space usage is random/indeterminate/evenly
    distributed, then, assuming that any given string is primarily in a single
    language, a TLV type discriminating between UTF-8 and UTF-16 should do
    nicely. Precede each string with an OR of the MSB (0 for UTF-8, 1 for
    UTF-16) and the length, in octets, of the string (therefore max of 32,767
    octets per string, which shouldn't ordinarily be a problem). Then encode
    the string in your efficiency-chosen format. Since you have a length, you
    can skip the terminator. The resulting structure is at most one byte longer
    than the string would have been had it been encoded as straight UTF-8 or
    UTF-16, and is double octet aligned, so native UTF-16 functions can be used
    if they exist.



    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:14:59 EST