Every character code in the world

From: John Cowan (jcowan@reutershealth.com)
Date: Fri Nov 15 2002 - 11:38:48 EST

  • Next message: Markus Scherer: "Re: IBM AIX 5 and GB18030"

    This is not a proposal to change standards in any respect. It's just a
    thought-out (well, somewhat) approach for people who have to represent
    character codes as opposed to characters, and have 32 bits to play with.

    The intent is to represent all the codes of all the registered character
    sets, present and future, as individual unsigned 31-bit integers.
    All further numbers in this post, except 94, 96, and 2022, are base 16.

    Unicode codes are mapped onto the integers 0-10FFFF in the obvious way.
    Registered character sets of ISO 2022 are represented by codes above 2000000.

    The detailed roadmap is as follows:

    00000000-0010FFFF: Unicode
    00110000-1FFFFFFF: reserved
    20000000-2003FFFF: ISO 2022 94-char, 96-char, C0, and C1 character sets
    20040000-2093FFFF: ISO 2022 94x94/96x96-char character sets
    20940000-5693FFFF: ISO 2022 94x94x94/96x96x96-char character sets
    56940000-7FFFFFFF: reserved

    Definitions for ISO 2022 character sets:
    Every character set has an ISO-specified value between 40 and 7E, called F.
    Some character sets have an ISO-specified value between 21 and 2F, called I.
            If I is not present, it is deemed for our purposes to 20.
    Individual characters in one-byte character sets have a value between 20
            and 7F, called H.
    Individual characters in two-byte character sets have two values between 20
            and 7F, called H and L.
    Individual characters in three-byte character sets have three values between 20
            and 7F, called H, M, and L.

    Values:
    The value of a character in Unicode is its code value.
    The value of a character in a 94-bit character set
            is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H.
    The value of a character in a 96-bit character set
            is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H + 80.
    The value of a character in a 94x94-char or 96x96-char character set
            is 20040000 + (I - 20) * 90000 (F - 40) * 2400 +
                    (H - 20) * 60 + (L - 20).
    The value of a character in a 94x94x94-char or 96x96x96-char character set
            is 20940000 + (I - 20) * 3600000 + (F - 40) * D8000 +
                    (H - 20) * 2400 + (M - 20) * 60 + L.

    This scheme was inspired by a related scheme by Markus Kuhn.

    -- 
    John Cowan    http://www.ccil.org/~cowan   <jcowan@reutershealth.com>
        "Any legal document draws most of its meaning from context.  A telegram
        that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in
        5-bit Baudot code plus appropriate headers) is as good a legal document
        as any, even sans digital signature." --me
    


    This archive was generated by hypermail 2.1.5 : Fri Nov 15 2002 - 12:33:11 EST