Encoding for Fun (was Line Separator)

From: Jill Ramonsky (Jill.Ramonsky@aculab.com)
Date: Wed Oct 22 2003 - 05:03:17 CST


> -----Original Message-----
> From: Doug Ewell [mailto:dewell@adelphia.net]
> Sent: Wednesday, October 22, 2003 6:19 AM
> To: Unicode Mailing List
> Cc: Marco Cimarosti; Jill Ramonsky
> Subject: Re: Line Separator and Paragraph Separator
> Importance: Low
>
>
> Jill, I'd be interested in details of your invented
> encodings, just for
> fun. Please e-mail privately to avoid incurring the wrath of
> group (b).
>

I'm going to risk the wrath of the group because I hereby place this in
the public domain. Now you can't patent it! :-)
Unicode list, please note, I used this a few years back /internally/,
within one particular piece of software. It was never intended for wider
use ... and that's the case for the defence, m'lud!

The only invented encoding which got any real use was the following
(currently nameless) one:

We define an 8X byte as a byte with bit pattern 1000xxxx
We define a 9X byte as a byte with bit pattern 1001xxxx

The rules are:
(1) If the codepoint is in the range U+00 to U+7F, represent it as a
single byte (that covers ASCII)
(2) If the codepoint is in the range U+A0 to U+FF, also represent it as
a single byte (that covers Latin-1, minus the C1 controls)
(3) In all other cases, represent the codepoint as a sequence of one or
more 8X bytes followed by a single 9X byte.

A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1)
bits of "payload", which are then interpretted literally as a Unicode
codepoint.

EXAMPLES:
U+2A ('*') would be represented as 2A (all Latin-1 chars are left
unchanged apart from the C1s).
U+85 (NEL) would be represented as 88 95 (just to prove that we haven't
lost the C1 controls altogether!)
U+20AC (Euro sign) would be represented as 82 80 8A 9C

As you can see, the hex value of the encoded codepoint is actually
"readable" from the hex, if you just look at the second nibble of each
8X or 9X byte.

Another interesting feature: starting from a random point in a string,
it is easy to scan backwards or forwards to find the start-byte or
end-byte of a character. This is valuable, as it means that you don't
have to parse a string from the beginning in order not to get lost.

Finally, of course, the big plus is that it "looks like ASCII". Although
this was used for "internal use only", it is interesting to speculate
how it might have been declared, had it been a published encoding.
Because, you see, it is quite interprettable by any engine which
understands only Latin-1. The worst outcome is that any 8X...9X
sequences will be incorrectly displayed as multiple unknown-character
glyphs ... but that is not /much/ worse than displaying a single
unknown-character glyph. On the other hand, if you declare it as
"LATIN-1-PLUS" or something, then any application which does not
recognise that encoding name will be forced to interpret the stream as
7-bit, ASCII, thereby replacing all codepoints above U+7F with '?' or
something. Which behavior is preferable, I wonder? What we'd really
/want/ the encoding name to say is "interpret as LATIN-1-PLUS if you
can, otherwise interpret as LATIN-1", but there doesn't seem to any way
of saying that with current encoding nonclamenture.

Jill



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST