Re: Unicode forms for internal storage

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Jan 20 2004 - 12:52:40 EST

  • Next message: Peter Kirk: "Pseudo-IPA characters for Russian"

    You need not invent something new: Just use a simplified SCSU encoder, and either a regular SCSU
    decoder or one that only supports the features which your custom encoder uses.

    For a tiny SCSU encoder (main function 75 lines of commented C) that also compresses a little better
    than what you describe see http://www.mindspring.com/~markus.scherer/unicode/tr6/

    You could scale that encoder up or down to your liking.

    For a full SCSU converter you could use ICU, for example. http://oss.software.ibm.com/icu/

    You could also use BOCU-1.

    With ICU you need not write anything new :-)
    (If you need only parts of ICU, see http://oss.software.ibm.com/icu/userguide/packaging.html)

    Best regards,
    markus

    Elliotte Rusty Harold wrote:
    > Last night it occurred to me it might be possible to design an internal
    > storage format for this class which had better memory usage
    > characteristics. In particular I'd like ASCII data to occupy only a
    > single byte, and all other BMP characters from 128 to 65535 to occupy
    > only two bytes. Non-BMP characters could be stored in surrogate pairs.



    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:36:57 EST