RE: Encoding for Fun (was Line Separator)

From: Jill Ramonsky (Jill.Ramonsky@aculab.com)
Date: Wed Oct 22 2003 - 08:42:09 CST


Well, that was considerably less wrath than I was expecting. Phew!

But to justify a few design decisions - yes, the encoding is longer (in
general) than UTF-8, but UTF-8 only attempts to preserve ASCII. I needed
to preserve ISO-8859-1. The reasons for this are complicated, but
basically I had to find a way to feed a Unicode string (originally an
array of 32-bit integers) into a legacy engine which was designed, many
years previously (by somebody else), to assume that everythingin the
world was Latin-1. That legacy engine /did/ take ascribe meaning to the
U+A0 to U+FF characters, so I couldn't use them for anything else. But
all I needed it to do with the non-Latin-1-Unicode characters was
preserve them. Essentially, I needed round-trip compatibility when
converting from Unicode to Latin-1 and back. This is of course
impossible ... but the C1 controls weren't being used, so I made it
possible.

Security wasn't an issue, as the encoding never "leaked" into the
outside world, and its spec was never published. If I had wanted to use
it for interchange, I would obviously have further specified that all
characters be stored in the minimum number of bytes. My software didn't
check for violations of this, but only because it didn't need to.

Jill



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST