Re: Unicode forms for internal storage

From: Markus Scherer (
Date: Tue Jan 20 2004 - 12:52:40 EST

  • Next message: Peter Kirk: "Pseudo-IPA characters for Russian"

    You need not invent something new: Just use a simplified SCSU encoder, and either a regular SCSU
    decoder or one that only supports the features which your custom encoder uses.

    For a tiny SCSU encoder (main function 75 lines of commented C) that also compresses a little better
    than what you describe see

    You could scale that encoder up or down to your liking.

    For a full SCSU converter you could use ICU, for example.

    You could also use BOCU-1.

    With ICU you need not write anything new :-)
    (If you need only parts of ICU, see

    Best regards,

    Elliotte Rusty Harold wrote:
    > Last night it occurred to me it might be possible to design an internal
    > storage format for this class which had better memory usage
    > characteristics. In particular I'd like ASCII data to occupy only a
    > single byte, and all other BMP characters from 128 to 65535 to occupy
    > only two bytes. Non-BMP characters could be stored in surrogate pairs.

    This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 14:36:57 EST