Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:59:28 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> But if one has a 32-bit file, and wants it put up on the Internet, and
> be sure that endianness comes out right´, I just noted that such a UTF-8
> extension could be used for that.

This is a *terrible* idea. It is from just such inappropriate extensions
of character encoding forms to represent non-character data that
character encoding messes derive from. Putting up something that
masquerades as UTF-8 and is guaranteed to be misinterpreted as
UTF-8 when it is not, is just a recipe for *non*-interoperability and
trashed data.

> Most likely, people are developing other
> such byte-formats, for special use. This is probably not really of much
> concern to Unicode.

Actually, when it involves people suggesting inappropriate extensions
to UTF-8, it is a concern to everyone involved in processing UTF-8
data -- which is just about everybody.

> But if, for some unforeseen reason, one would want to go
> beyond the 21-bit limit,

Going beyond the 21-bit limit is non-conformant, or it isn't use
of characters in the standard.

Mixing characters and arbitrary binary stuff in the same numerical
space in binary datatypes is just bad software engineering.

> it might be good to know what it should look like.
> And in my regular expression generator, I can do whatever I want,

Of course.

> once I go
> beyond the 21-bit limit -- I need only to make sure that the user of it
> finds it convenient.

... and that it doesn't leak out, (mis)labelled as UTF-8 (which it
will), where it will scare the horses in the street.

--Ken

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:00:15 CST