Unicode forms for internal storage

From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Tue Jan 20 2004 - 11:59:28 EST

Next message: Mike Ayers: "RE: Unicode forms for internal storage"

Previous message: Francois Yergeau: "RE: Unicode forms for internal storage"
Next in thread: Francois Yergeau: "RE: Unicode forms for internal storage"
Maybe reply: Francois Yergeau: "RE: Unicode forms for internal storage"
Maybe reply: Mike Ayers: "RE: Unicode forms for internal storage"
Reply: Markus Scherer: "Re: Unicode forms for internal storage"
Reply: Philippe Verdy: "Re: Unicode forms for internal storage"
Reply: Doug Ewell: "Re: Unicode forms for internal storage"
Maybe reply: Doug Ewell: "Re: Unicode forms for internal storage"
Reply: Jon Hanna: "Re: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm currently working on a project (XOM,
<http://www.cafeconleche.org/XOM/>) in which the Unicode text data is
a significant portion of memory usage in many important use cases.
Currently, for the major class where this is an issue in practice (as
proved by profiling), I store the data as UTF-8. This means ASCII
data takes half the space it would in UTF-16, and many other
characters take only the same amount as they would in UTF-16. However
CJK characters tend to take up 50% more space than they woudl in
UTF-16.

Last night it occurred to me it might be possible to design an
internal storage format for this class which had better memory usage
characteristics. In particular I'd like ASCII data to occupy only a
single byte, and all other BMP characters from 128 to 65535 to occupy
only two bytes. Non-BMP characters could be stored in surrogate pairs.

In developing such a format I have a couple of advantages:

1. Most C0 controls are forbidden, and will not appear in the data.
That's already verified. If someone tries to pass in a C0 control
other than tab, linefeed, or carriage return to setValue, an
exception is thrown and the data is not stored. Potentially one or
more of these characters could be used as markers in the stream.

2. I do not need random access to parts of the data, only to whole
strings. Unlike with UTF-8, it is not important to be able to look at
a single byte in isolation and tell immediately which part of what
kind of character it is.

3. This is all completely private to one class. No data in this form
will be passed on the wire. None will be exposed via the public API
which is completely based on Java strings (that is, UTF-16).

However, I would like the translation into and out of this format to
be at least as fast as the translation between UTF-8 and UTF-16 the
class is currently performing on every call to setValue and getValue,
ideally faster.

Has anyone done any work on Unicode formats for this use-case? Does
anyone have any references or ideas to share?

-- 
   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA

Next message: Mike Ayers: "RE: Unicode forms for internal storage"
Previous message: Francois Yergeau: "RE: Unicode forms for internal storage"
Next in thread: Francois Yergeau: "RE: Unicode forms for internal storage"
Maybe reply: Francois Yergeau: "RE: Unicode forms for internal storage"
Maybe reply: Mike Ayers: "RE: Unicode forms for internal storage"
Reply: Markus Scherer: "Re: Unicode forms for internal storage"
Reply: Philippe Verdy: "Re: Unicode forms for internal storage"
Reply: Doug Ewell: "Re: Unicode forms for internal storage"
Maybe reply: Doug Ewell: "Re: Unicode forms for internal storage"
Reply: Jon Hanna: "Re: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:07:28 EST