RE: Unicode forms for internal storage

From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Jan 20 2004 - 13:01:26 EST

Next message: Elliotte Rusty Harold: "Unicode forms for internal storage"

Previous message: Mike Ayers: "RE: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"
Next in thread: Mike Ayers: "RE: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Look at SCSU (http://www.unicode.org/reports/tr6/) and BOCU-1
(http://www.unicode.org/notes/tn6/).

-- 
François
> -----Message d'origine-----
> De : Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Envoyé : 20 janvier 2004 11:59
> À : unicode@unicode.org
> Cc : xom-interest@lists.ibiblio.org
> Objet : Unicode forms for internal storage
> 
> 
> I'm currently working on a project (XOM, 
> <http://www.cafeconleche.org/XOM/>) in which the Unicode text data is 
> a significant portion of memory usage in many important use cases. 
> Currently, for the major class where this is an issue in practice (as 
> proved by profiling), I store the data as UTF-8. This means ASCII 
> data takes half the space it would in UTF-16, and many other 
> characters take only the same amount as they would in UTF-16. However 
> CJK characters tend to take up 50% more space than they woudl in 
> UTF-16.
> 
> Last night it occurred to me it might be possible to design an 
> internal storage format for this class which had better memory usage 
> characteristics. In particular I'd like ASCII data to occupy only a 
> single byte, and all other BMP characters from 128 to 65535 to occupy 
> only two bytes. Non-BMP characters could be stored in surrogate pairs.
> 
> In developing such a format I have a couple of advantages:
> 
> 1. Most C0 controls are forbidden, and will not appear in the data. 
> That's already verified. If someone tries to pass in a C0 control 
> other than tab, linefeed, or carriage return to setValue, an 
> exception is thrown and the data is not stored. Potentially one or 
> more of these characters could be used as markers in the stream.
> 
> 2. I do not need random access to parts of the data, only to whole 
> strings. Unlike with UTF-8, it is not important to be able to look at 
> a single byte in isolation and tell immediately which part of what 
> kind of character it is.
> 
> 3. This is all completely private to one class. No data in this form 
> will be passed on the wire. None will be exposed via the public API 
> which is completely based on Java strings (that is, UTF-16).
> 
> However, I would like the translation into and out of this format to 
> be at least as fast as the translation between UTF-8 and UTF-16 the 
> class is currently performing on every call to setValue and getValue, 
> ideally faster.
> 
> Has anyone done any work on Unicode formats for this use-case? Does 
> anyone have any references or ideas to share?
> -- 
> 
>    Elliotte Rusty Harold
>    elharo@metalab.unc.edu
>    Effective XML (Addison-Wesley, 2003)
>    http://www.cafeconleche.org/books/effectivexml
>    
> http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosi
m/cafeaulaitA

Next message: Elliotte Rusty Harold: "Unicode forms for internal storage"
Previous message: Mike Ayers: "RE: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)"
Next in thread: Mike Ayers: "RE: Unicode forms for internal storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 13:53:03 EST