Re: Encoding for Fun (was Line Separator)

From: jon@hackcraft.net
Date: Wed Oct 22 2003 - 06:53:30 CST


> The only invented encoding which got any real use was the following
> (currently nameless) one:
>
> We define an 8X byte as a byte with bit pattern 1000xxxx
> We define a 9X byte as a byte with bit pattern 1001xxxx
>
> The rules are:
> (1) If the codepoint is in the range U+00 to U+7F, represent it as a
> single byte (that covers ASCII)
> (2) If the codepoint is in the range U+A0 to U+FF, also represent it as
> a single byte (that covers Latin-1, minus the C1 controls)
> (3) In all other cases, represent the codepoint as a sequence of one or
> more 8X bytes followed by a single 9X byte.
>
> A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1)
> bits of "payload", which are then interpretted literally as a Unicode
> codepoint.
>
> EXAMPLES:
> U+2A ('*') would be represented as 2A (all Latin-1 chars are left
> unchanged apart from the C1s).
> U+85 (NEL) would be represented as 88 95 (just to prove that we haven't
> lost the C1 controls altogether!)
> U+20AC (Euro sign) would be represented as 82 80 8A 9C

If you used this for interchange between components there would be a potential
security issue if you allowed for "over-long" encodings, such as encoding
U+002F as 0x82 0x9F.

Beyond that of course one can use whatever encodings one wants privately.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST