Re: Encoding for Fun (was Line Separator)

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Oct 23 2003 - 17:22:57 CST

Next message: Mark Davis: "Re: [OT] RE: GDP by language"
Previous message: Markus Scherer: "Re: FW: Web Form: Other Question, Problem, or Feedback"
Maybe in reply to: Jill Ramonsky: "Encoding for Fun (was Line Separator)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Arcane Jill wrote:

> I'm going to risk the wrath of the group because I hereby place this
> in the public domain. Now you can't patent it! :-)

But I can implement it. <grin />

> The only invented encoding which got any real use was the following
> (currently nameless) one:

Jill referred to it later in her message as "Latin-1-Plus," but I think
Frank da Cruz might consider it a "minus" instead because of its
non-2022-conformant use of C1 code points.

I like John Cowan's suggestion of "UTF-4." Sure, it's not an official
UTF, but neither are UTF-5, -6, -7d5, -9, -17, or -64, all names that
have been used in the past for these unofficial (and sometimes
facetious) encodings.

In fact, given the Unicode 4.0 definition (D29) that a "Unicode
Transformation Format" is a CEF or CES, not a transfer encoding syntax,
I'd say Jill's format is more of a UTF than UTF-7 is.

> We define an 8X byte as a byte with bit pattern 1000xxxx
> We define a 9X byte as a byte with bit pattern 1001xxxx
>
> The rules are:
> (1) If the codepoint is in the range U+00 to U+7F, represent it as a
> single byte (that covers ASCII)
> (2) If the codepoint is in the range U+A0 to U+FF, also represent it
> as a single byte (that covers Latin-1, minus the C1 controls)
> (3) In all other cases, represent the codepoint as a sequence of one
> or more 8X bytes followed by a single 9X byte.
>
> A sequence of N 8X bytes plus one 9X bytes therefore contains 4(N+1)
> bits of "payload", which are then interpretted literally as a Unicode
> codepoint.
>
> EXAMPLES:
> U+2A ('*') would be represented as 2A (all Latin-1 chars are left
> unchanged apart from the C1s).
> U+85 (NEL) would be represented as 88 95 (just to prove that we
> haven't lost the C1 controls altogether!)
> U+20AC (Euro sign) would be represented as 82 80 8A 9C

I note, for the authors of UTN #6, that Jill's plain-English description
is an *excellent* example of what needs to be written for BOCU-1, if
that encoding is ever going to get off the ground.

> Another interesting feature: starting from a random point in a string,
> it is easy to scan backwards or forwards to find the start-byte or
> end-byte of a character. This is valuable, as it means that you don't
> have to parse a string from the beginning in order not to get lost.

This feature works just as well as the similar feature in UTF-8. The
only difference is that here we have one or more lead bytes and only one
trail byte, while in UTF-8 it's the other way around.

> What we'd really want the encoding name to say is "interpret as
> LATIN-1-PLUS if you can, otherwise interpret as LATIN-1", but there
> doesn't seem to any way of saying that with current encoding
> nonclamenture.

A fellow named Dan Oscarsson suggested something similar 5 years ago.
He was frustrated by the lack of transparency between UTF-8 and Latin-1,
and he proposed something he called "Adaptive UTF-8," in which any
invalid UTF-8 sequence would be interpreted as Latin-1. An "Adaptive
UTF-8" engine would thus be able to read his existing collection of
Latin-1 files as well as UTF-8 files.

One problem with this hybrid encoding was that, unlike plain UTF-8, it
would require lookahead to avoid encoding two or more Latin-1 characters
literally such that they would form a valid UTF-8 sequence. (I call
this the NESTLÉ® problem because the two characters U+00C9 U+00AE , if
encoded literally as Latin-1, would form the valid UTF-8 sequence for
U+026E LATIN SMALL LETTER LEZH.) The other problem is that this
"adaptive" encoding would *require* the special decoder, so existing
UTF-8 engines would choke on it.

Browsers and e-mail programs are often able to "auto-detect" an unknown
encoding in the way Jill describes, but the better solution -- if
possible -- is to design encodings that fall back to existing encodings,
as UTF-8 falls back to ASCII, and Jill's UTF-4 falls back to Latin-1.

Anyone who is interested in the implementation, just let me know --
though it's not exactly complicated.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Mark Davis: "Re: [OT] RE: GDP by language"
Previous message: Markus Scherer: "Re: FW: Web Form: Other Question, Problem, or Feedback"
Maybe in reply to: Jill Ramonsky: "Encoding for Fun (was Line Separator)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST