Internal Representation of Unicode

From: myrkraverk@users.sourceforge.net
Date: Thu Sep 25 2003 - 20:53:07 EDT

Next message: Markus Scherer: "Re: Unicode Normalisaton Optimisation Experiments"

Previous message: Rick Cameron: "RE: Web Form: Other Question: Unicode characters in Form in MSAcc ess"
Next in thread: John Cowan: "Re: Internal Representation of Unicode"
Reply: John Cowan: "Re: Internal Representation of Unicode"
Maybe reply: jameskass@att.net: "Re: Internal Representation of Unicode"
Maybe reply: Marco Cimarosti: "RE: Internal Representation of Unicode"
Maybe reply: Rick McGowan: "Re: Internal Representation of Unicode"
Maybe reply: Jill Ramonsky: "RE: Internal Representation of Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi,

In a plain text environment, there is often a need to encode more than
just the plain character. A console, or terminal emulator, is such an
environment. Therefore I propose the following as a technical report
for internal encoding of unicode characters; with one goal in mind:
character equalence is binary equalence.

Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
EMUE.

I thought of dividing the 64 bit code space into 32 variably wide
plains, one for control characters, one for latin characters, one for
han characters, and so on; using 5 bits and the next 3 fixed to zero
(for future expansion and alignment to an octet).

I call plain 0 control characters and won't discuss it further.

Plain 1, I had intended for latin characters with the following
encoding method in mind:

* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
* Res Reserved (8 bits)
* Uacc Upper Accent (8 bits)
* Lacc Lower Accent (8 bits)
* Res Reserved (8 bits)
* Char Character (8 bits)

All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for. This allows for
255 upper and lower accents which should be enough -- for now.

For Han characters I thought of the following encoding method (with no
particular plain in mind):

* Plain Plain (5 bits)
* Zero Zero bits (3 bits)
* Attr Attributes (16 bits)
* Style Stylistic Variation (8 bits)
* Char Character (32 bits)

Again, all fields are implementation defined. Telling something like
a terminal emulator what stylistic variation to use is outside the
scope of this email, but for attributes, there are standardized escape
sequences; but I suspect language tags can be used.

I was also thinking of a plain for punctuation and symbolic characters.

I will be pleased if anyone can come up with better encoding methods
than I did, and I call upon other people to come up with encodings for
scripts I know nothing about, such as arabic and others. Then let's
wrap it up in a technical report and be done with it ;)

Any comments?

Johann

-- 
Sometimes I do not think at all!  Does that mean I don't exist
in the mean time?

Next message: Markus Scherer: "Re: Unicode Normalisaton Optimisation Experiments"
Previous message: Rick Cameron: "RE: Web Form: Other Question: Unicode characters in Form in MSAcc ess"
Next in thread: John Cowan: "Re: Internal Representation of Unicode"
Reply: John Cowan: "Re: Internal Representation of Unicode"
Maybe reply: jameskass@att.net: "Re: Internal Representation of Unicode"
Maybe reply: Marco Cimarosti: "RE: Internal Representation of Unicode"
Maybe reply: Rick McGowan: "Re: Internal Representation of Unicode"
Maybe reply: Jill Ramonsky: "RE: Internal Representation of Unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 21:39:26 EDT