Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 17:51:20 CST

  • Next message: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"

    On 2005/01/19 20:59, Kenneth Whistler at kenw@sybase.com wrote:

    >> But if one has a 32-bit file, and wants it put up on the Internet, and
    >> be sure that endianness comes out right´, I just noted that such a UTF-8
    >> extension could be used for that.
    >
    > This is a *terrible* idea. It is from just such inappropriate extensions
    > of character encoding forms to represent non-character data that
    > character encoding messes derive from. Putting up something that
    > masquerades as UTF-8 and is guaranteed to be misinterpreted as
    > UTF-8 when it is not, is just a recipe for *non*-interoperability and
    > trashed data.

    The idea was not it should masquerade as UTF-8. In fact, there appears mess
    deriving from the fact that the name UTF-8 has been used for a number of
    different transformations. If one had proceeded as I suggested, this mess
    would never had occurred.

    >> Most likely, people are developing other
    >> such byte-formats, for special use. This is probably not really of much
    >> concern to Unicode.
    >
    > Actually, when it involves people suggesting inappropriate extensions
    > to UTF-8, it is a concern to everyone involved in processing UTF-8
    > data -- which is just about everybody.

    Well, from the point of Unicode, it would not be a character range, but a
    merely transformation numbers into bytes. I find it hard for Unicode to
    prevent such a thing by pretending it does not exist.

    >> But if, for some unforeseen reason, one would want to go
    >> beyond the 21-bit limit,
    >
    > Going beyond the 21-bit limit is non-conformant, or it isn't use
    > of characters in the standard.

    We all know that it is. But this does not mean that folks will not go beyond
    it. This happened with ASCII, and so it may happen with Unicode.

    > Mixing characters and arbitrary binary stuff in the same numerical
    > space in binary datatypes is just bad software engineering.

    Actually, I thought one might use as just a available format. But some folks
    may use it in connection with machine generated grammars. They will do it
    regardless what Unicode says about it.

    >> it might be good to know what it should look like.
    >> And in my regular expression generator, I can do whatever I want,
    >
    > Of course.
    >
    >> once I go
    >> beyond the 21-bit limit -- I need only to make sure that the user of it
    >> finds it convenient.
    >
    > ... and that it doesn't leak out, (mis)labelled as UTF-8 (which it
    > will), where it will scare the horses in the street.

    So what do you suggest I should call it? CPBTF-8? :-)

    And it is already mislabelled as UTF-8, due to its heritage in RFC's. Also
    see for example <http://www.cl.cam.ac.uk/~mgk25/unicode.html>. So Unicode
    has already failed in this respect.

    Otherwise it is the writer of the lexer that provides the correct labelling,
    not the Flex developer.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 17:52:44 CST