Re: Encoding for Fun (was Line Separator)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 22 2003 - 06:09:24 CST


From: Jill Ramonsky
> From: Doug Ewell [mailto:dewell@adelphia.net]
> >
> > Jill, I'd be interested in details of your invented
> > encodings, just for
> > fun. Please e-mail privately to avoid incurring the wrath of
> > group (b).
>
> I'm going to risk the wrath of the group because I hereby
> place this in the public domain. Now you can't patent it! :-)
> Unicode list, please note, I used this a few years back
> internally, within one particular piece of software.
> It was never intended for wider use ... and that's the case
> for the defence, m'lud!
>
> The only invented encoding which got any real use was the
> following (currently nameless) one:
>
> We define an 8X byte as a byte with bit pattern 1000xxxx
> We define a 9X byte as a byte with bit pattern 1001xxxx
>
> The rules are:
> (1) If the codepoint is in the range U+00 to U+7F, represent
> it as a single byte (that covers ASCII)
> (2) If the codepoint is in the range U+A0 to U+FF, also represent
> it as a single byte (that covers Latin-1, minus the C1 controls)
> (3) In all other cases, represent the codepoint as a sequence of
> one or more 8X bytes followed by a single 9X byte.
> A sequence of N 8X bytes plus one 9X bytes therefore contains
> 4(N+1) bits of "payload", which are then interpretted literally as
> a Unicode codepoint.
>
> EXAMPLES:
> U+2A ('*') would be represented as 2A (all Latin-1 chars are left
> unchanged apart from the C1s).
> U+85 (NEL) would be represented as 88 95 (just to prove that we
> haven't lost the C1 controls altogether!)
> U+20AC (Euro sign) would be represented as 82 80 8A 9C
>
> As you can see, the hex value of the encoded codepoint is actually
> "readable" from the hex, if you just look at the second nibble of
> each 8X or 9X byte.

That's a quite simple encoding. At least it has the merit of not being
restricted in encoding length (but this may also be a security issue
in systems that would implement it, as there's no limitation in the
number of bytes to scan forward or backward to get the whole
sequence, unless you specify that there can be no more than
five 8X bytes, as the the longest valid sequence would be
{0x81, 0x80, 0x8F, 0x8F, 0x8F, 0x9D}=U+10FFFD)
However UTF-8 is much more compact.

The second merit is that the technic can be used on top of all
ISO-8859-* charsets, by replacing the C1 controls mapped in
0x8X and 0x9X positions.

It could as well be mapped over EBCDIC, using the mapping
between standard ISO Latin 1 and EBCDIC Latin 1, but there's
a problem caused by the legacy and widely used controls NEL:

You can't then say that it is fully compatible with ISO-8859-1,
as it breaks the reversible compatibility with an EBCDIC
transcoding (unless you are sure that no internal system or
protocol will transcode your text files to/from EBCDIC). But one
could argue that 8-bit JIS and EUC do not also offer this
reversibility of encodings for C1 controls, except through
ISO2022 codepage-switches and escaping mechanisms which
allow a reversible conversion between 8-bit and 7-bit encodings
(through <SS2>, <SI> and <SO> controls and escape-sequences)



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST