Re: 32'nd bit & UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 19 2005 - 13:59:28 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Subject: Re: 32'nd bit & UTF-8"

    > But if one has a 32-bit file, and wants it put up on the Internet, and
    > be sure that endianness comes out right´, I just noted that such a UTF-8
    > extension could be used for that.

    This is a *terrible* idea. It is from just such inappropriate extensions
    of character encoding forms to represent non-character data that
    character encoding messes derive from. Putting up something that
    masquerades as UTF-8 and is guaranteed to be misinterpreted as
    UTF-8 when it is not, is just a recipe for *non*-interoperability and
    trashed data.

    > Most likely, people are developing other
    > such byte-formats, for special use. This is probably not really of much
    > concern to Unicode.

    Actually, when it involves people suggesting inappropriate extensions
    to UTF-8, it is a concern to everyone involved in processing UTF-8
    data -- which is just about everybody.

    > But if, for some unforeseen reason, one would want to go
    > beyond the 21-bit limit,

    Going beyond the 21-bit limit is non-conformant, or it isn't use
    of characters in the standard.

    Mixing characters and arbitrary binary stuff in the same numerical
    space in binary datatypes is just bad software engineering.

    > it might be good to know what it should look like.
    > And in my regular expression generator, I can do whatever I want,

    Of course.

    > once I go
    > beyond the 21-bit limit -- I need only to make sure that the user of it
    > finds it convenient.

    ... and that it doesn't leak out, (mis)labelled as UTF-8 (which it
    will), where it will scare the horses in the street.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:00:15 CST