Unicode forms for internal storage

From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Tue Jan 20 2004 - 11:59:28 EST

  • Next message: Mike Ayers: "RE: Unicode forms for internal storage"

    I'm currently working on a project (XOM,
    <http://www.cafeconleche.org/XOM/>) in which the Unicode text data is
    a significant portion of memory usage in many important use cases.
    Currently, for the major class where this is an issue in practice (as
    proved by profiling), I store the data as UTF-8. This means ASCII
    data takes half the space it would in UTF-16, and many other
    characters take only the same amount as they would in UTF-16. However
    CJK characters tend to take up 50% more space than they woudl in
    UTF-16.

    Last night it occurred to me it might be possible to design an
    internal storage format for this class which had better memory usage
    characteristics. In particular I'd like ASCII data to occupy only a
    single byte, and all other BMP characters from 128 to 65535 to occupy
    only two bytes. Non-BMP characters could be stored in surrogate pairs.

    In developing such a format I have a couple of advantages:

    1. Most C0 controls are forbidden, and will not appear in the data.
    That's already verified. If someone tries to pass in a C0 control
    other than tab, linefeed, or carriage return to setValue, an
    exception is thrown and the data is not stored. Potentially one or
    more of these characters could be used as markers in the stream.

    2. I do not need random access to parts of the data, only to whole
    strings. Unlike with UTF-8, it is not important to be able to look at
    a single byte in isolation and tell immediately which part of what
    kind of character it is.

    3. This is all completely private to one class. No data in this form
    will be passed on the wire. None will be exposed via the public API
    which is completely based on Java strings (that is, UTF-16).

    However, I would like the translation into and out of this format to
    be at least as fast as the translation between UTF-8 and UTF-16 the
    class is currently performing on every call to setValue and getValue,
    ideally faster.

    Has anyone done any work on Unicode formats for this use-case? Does
    anyone have any references or ideas to share?

    -- 
       Elliotte Rusty Harold
       elharo@metalab.unc.edu
       Effective XML (Addison-Wesley, 2003)
       http://www.cafeconleche.org/books/effectivexml
       http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:07:28 EST