Internal Representation of Unicode

From: myrkraverk@users.sourceforge.net
Date: Thu Sep 25 2003 - 20:53:07 EDT

  • Next message: Markus Scherer: "Re: Unicode Normalisaton Optimisation Experiments"

    Hi,

    In a plain text environment, there is often a need to encode more than
    just the plain character. A console, or terminal emulator, is such an
    environment. Therefore I propose the following as a technical report
    for internal encoding of unicode characters; with one goal in mind:
    character equalence is binary equalence.

    Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
    EMUE.

    I thought of dividing the 64 bit code space into 32 variably wide
    plains, one for control characters, one for latin characters, one for
    han characters, and so on; using 5 bits and the next 3 fixed to zero
    (for future expansion and alignment to an octet).

    I call plain 0 control characters and won't discuss it further.

    Plain 1, I had intended for latin characters with the following
    encoding method in mind:

    bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0
        +-------+------+------+------+------+------+------+------+
        | plain | zero | attr | res | uacc | lacc | res | char |
        +-------+------+------+------+------+------+------+------+

    * Plain Plain (5 bits)
    * Zero Zero bits (3 bits)
    * Attr Attributes (16 bits)
    * Res Reserved (8 bits)
    * Uacc Upper Accent (8 bits)
    * Lacc Lower Accent (8 bits)
    * Res Reserved (8 bits)
    * Char Character (8 bits)

    All of these fields are actually implementation defined, with just one
    rule for char: don't include characters that can be made with
    combinations, that's what the accent fields are for. This allows for
    255 upper and lower accents which should be enough -- for now.

    For Han characters I thought of the following encoding method (with no
    particular plain in mind):

    bits 63..59 58..56 55..40 39..32 31 .. 0
        +-------+------+------+-------+--------------------------+
        | plain | zero | attr | style | char |
        +-------+------+------+-------+--------------------------+

    * Plain Plain (5 bits)
    * Zero Zero bits (3 bits)
    * Attr Attributes (16 bits)
    * Style Stylistic Variation (8 bits)
    * Char Character (32 bits)

    Again, all fields are implementation defined. Telling something like
    a terminal emulator what stylistic variation to use is outside the
    scope of this email, but for attributes, there are standardized escape
    sequences; but I suspect language tags can be used.

    I was also thinking of a plain for punctuation and symbolic characters.

    I will be pleased if anyone can come up with better encoding methods
    than I did, and I call upon other people to come up with encodings for
    scripts I know nothing about, such as arabic and others. Then let's
    wrap it up in a technical report and be done with it ;)

    Any comments?

    Johann

    -- 
    Sometimes I do not think at all!  Does that mean I don't exist
    in the mean time?
    


    This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 21:39:26 EDT