From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:59:28 CST
> But if one has a 32-bit file, and wants it put up on the Internet, and
> be sure that endianness comes out right´, I just noted that such a UTF-8
> extension could be used for that.
This is a *terrible* idea. It is from just such inappropriate extensions
of character encoding forms to represent non-character data that
character encoding messes derive from. Putting up something that
masquerades as UTF-8 and is guaranteed to be misinterpreted as
UTF-8 when it is not, is just a recipe for *non*-interoperability and
trashed data.
> Most likely, people are developing other
> such byte-formats, for special use. This is probably not really of much
> concern to Unicode.
Actually, when it involves people suggesting inappropriate extensions
to UTF-8, it is a concern to everyone involved in processing UTF-8
data -- which is just about everybody.
> But if, for some unforeseen reason, one would want to go
> beyond the 21-bit limit,
Going beyond the 21-bit limit is non-conformant, or it isn't use
of characters in the standard.
Mixing characters and arbitrary binary stuff in the same numerical
space in binary datatypes is just bad software engineering.
> it might be good to know what it should look like.
> And in my regular expression generator, I can do whatever I want,
Of course.
> once I go
> beyond the 21-bit limit -- I need only to make sure that the user of it
> finds it convenient.
... and that it doesn't leak out, (mis)labelled as UTF-8 (which it
will), where it will scare the horses in the street.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:00:15 CST