Re: UTF-8, U+0000 and Software Development (was: Re: New UTF-8 decoder stress test file)

From: Paul Keinanen (keinanen@sci.fi)
Date: Mon Sep 27 1999 - 02:31:09 EDT


On Sun, 26 Sep 1999 14:13:12 -0700 (PDT), Karl Pentzlin wrote:

>Von: Paul Dempsey (Exchange) <paulde@Exchange.Microsoft.com>
>An: 'Karl Pentzlin' <karl-pentzlin@acssoft.de>; Unicode List

>
>> Using UTF-8 to represent a 0 byte without 0-valued bytes is misusing UTF-8
>> (at least for text interchange).
>>
>> ...
>>
>> I've written quite a lot of text-processing code in C/C++ that handles
>> embedded NUL characters. There's nothing intrinsic to the language that
>> makes it especially difficult. I just don't use much of the standard ISO C
>> library.

>That is (somewhat exaggerated), to conform to one standard (UTF-8 encoding
>U+0000 strictly by a 0 byte), you decide against another standard (the ISO
>standard C libraries) - a standard which also was made for interchange,
>namely for program source interchange between different operating systems.
>
>The other point is, C/C++/Deplhi/Java programmers *will* use the 0xC0 0x80
>encoding of U+0000 regardless whether it strictly conform to the standard or
>not, as it makes the life for them much easier (and their work for their
>bosses cheaper). Therefore, you *will* find 0xC0 0x80 in text interchange
>files, whether the standard allows it or not, and therefore real-world
>applications *will* treat this encoding correctly (especially as the
>deviation from the written standard is that small, is transparent for all
>users, and U+0000 is not especially frequent in real text anyway - thus the
>cost/advantage ratio will in no case justify strict standard conformance
>economically).

While the null terminated character string is a nice way of storing
_constant_ _plain_text_ strings and it has been used at least since
early PDP-11 assemblers in the beginning of the 70's and was adopted
into the C-language later on for processing of _plain_text_ strings
(with various strxxx/wstrxxx functions), I do not understand why on
earth programmers insists of using C-string type of operations for
non-plain text data.

Any non-plain text operations have been carried out in C with
functions like memcpy (instead of strcpy), which works correctly
regardless of any nulls. This of course requires that the data size is
known, but it should not be any harder to maintain a proper descriptor
for that data (containing the size and address and the allocated size
for variable length data) and the use of counted strings when storing
to disk. This can be very transparent in object oriented languages in
which the implementation details can be hidden within a class.

So I still do not understand why someone would insist of using wide
null terminated C-strings in Java for non-plain text data (and use
UTF-8 look alike multibyte style instead of UCS-2 for internal
permanent storage, but that is an other question).

Paul Keinänen
 



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT