From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Tue Jan 18 2005 - 06:00:33 CST
On Monday, January 17th, 2005 18:06Z Hans Aberg va escriure:
> Are there any good reasons for UTF-[8] to exclude the 32'nd bit of
> an encoded 4-byte?
The ISO/IEC 10646 framework.
> I.e, the 6-byte combinations
> 111111xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> where the first x = 1.
> With a full 32-bit encoding, one can also use UTF-8 to encoding
> binary data.
Why?
Look, I have two computers.
One runs generally DOS softwares programmed with TurboPascal, and dealing
with 32 unsigned datas is a nightmare (no built-in data type), I have to go
back to assembly for every operations on such "binary datas", or else using
the 64 signed data type using the FPU, but with a noticeable performance
hit. I very much prefer having "16-bit binary datas" with it ;-).
Of course, in the real world I am using streams (including counted strings)
of 8-bit datas, like anybody.
The other has a 64-bit based architecture. I have difficulties to match your
proposition (about "full") above about it. In fact, I am already entangled
with softwares that was designed as "Unified architecture" and only
forecasted the use of 32-bit integers and pointers.
So I beg your pardon, but I feel a bit angry about your proposal.
> It also simplifies somewhat the implementation of
> Unicode in lexer generators (such as Flex): The leading byte then
> covers all 256 combinations. All 2^32 numbers should probably be
> there for generating proper lexer error messages.
Not sure if I understand you correctly. What about 00 vs. C0.80, E0.80.80,
FE.80.80.80.80.80.80 etc.?
Antoine
This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 06:05:18 CST