Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Tue Jan 18 2005 - 06:00:33 CST

  • Next message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"

    On Monday, January 17th, 2005 18:06Z Hans Aberg va escriure:

    > Are there any good reasons for UTF-[8] to exclude the 32'nd bit of
    > an encoded 4-byte?

    The ISO/IEC 10646 framework.

    > I.e, the 6-byte combinations
    > 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > where the first x = 1.

    > With a full 32-bit encoding, one can also use UTF-8 to encoding
    > binary data.

    Why?
    Look, I have two computers.
    One runs generally DOS softwares programmed with TurboPascal, and dealing
    with 32 unsigned datas is a nightmare (no built-in data type), I have to go
    back to assembly for every operations on such "binary datas", or else using
    the 64 signed data type using the FPU, but with a noticeable performance
    hit. I very much prefer having "16-bit binary datas" with it ;-).
    Of course, in the real world I am using streams (including counted strings)
    of 8-bit datas, like anybody.

    The other has a 64-bit based architecture. I have difficulties to match your
    proposition (about "full") above about it. In fact, I am already entangled
    with softwares that was designed as "Unified architecture" and only
    forecasted the use of 32-bit integers and pointers.
    So I beg your pardon, but I feel a bit angry about your proposal.

    > It also simplifies somewhat the implementation of
    > Unicode in lexer generators (such as Flex): The leading byte then
    > covers all 256 combinations. All 2^32 numbers should probably be
    > there for generating proper lexer error messages.

    Not sure if I understand you correctly. What about 00 vs. C0.80, E0.80.80,
    FE.80.80.80.80.80.80 etc.?

    Antoine



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 06:05:18 CST