RE: Unicode forms for internal storage

From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Jan 20 2004 - 13:01:26 EST

  • Next message: Elliotte Rusty Harold: "Unicode forms for internal storage"

    Look at SCSU (http://www.unicode.org/reports/tr6/) and BOCU-1
    (http://www.unicode.org/notes/tn6/).

    -- 
    François
    > -----Message d'origine-----
    > De : Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
    > Envoyé : 20 janvier 2004 11:59
    > À : unicode@unicode.org
    > Cc : xom-interest@lists.ibiblio.org
    > Objet : Unicode forms for internal storage
    > 
    > 
    > I'm currently working on a project (XOM, 
    > <http://www.cafeconleche.org/XOM/>) in which the Unicode text data is 
    > a significant portion of memory usage in many important use cases. 
    > Currently, for the major class where this is an issue in practice (as 
    > proved by profiling), I store the data as UTF-8. This means ASCII 
    > data takes half the space it would in UTF-16, and many other 
    > characters take only the same amount as they would in UTF-16. However 
    > CJK characters tend to take up 50% more space than they woudl in 
    > UTF-16.
    > 
    > Last night it occurred to me it might be possible to design an 
    > internal storage format for this class which had better memory usage 
    > characteristics. In particular I'd like ASCII data to occupy only a 
    > single byte, and all other BMP characters from 128 to 65535 to occupy 
    > only two bytes. Non-BMP characters could be stored in surrogate pairs.
    > 
    > In developing such a format I have a couple of advantages:
    > 
    > 1. Most C0 controls are forbidden, and will not appear in the data. 
    > That's already verified. If someone tries to pass in a C0 control 
    > other than tab, linefeed, or carriage return to setValue, an 
    > exception is thrown and the data is not stored. Potentially one or 
    > more of these characters could be used as markers in the stream.
    > 
    > 2. I do not need random access to parts of the data, only to whole 
    > strings. Unlike with UTF-8, it is not important to be able to look at 
    > a single byte in isolation and tell immediately which part of what 
    > kind of character it is.
    > 
    > 3. This is all completely private to one class. No data in this form 
    > will be passed on the wire. None will be exposed via the public API 
    > which is completely based on Java strings (that is, UTF-16).
    > 
    > However, I would like the translation into and out of this format to 
    > be at least as fast as the translation between UTF-8 and UTF-16 the 
    > class is currently performing on every call to setValue and getValue, 
    > ideally faster.
    > 
    > Has anyone done any work on Unicode formats for this use-case? Does 
    > anyone have any references or ideas to share?
    > -- 
    > 
    >    Elliotte Rusty Harold
    >    elharo@metalab.unc.edu
    >    Effective XML (Addison-Wesley, 2003)
    >    http://www.cafeconleche.org/books/effectivexml
    >    
    > http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosi
    m/cafeaulaitA
    


    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 13:53:03 EST