UTF-8, U+0000 and Software Development (was: Re: New UTF-8 decoder stress test file)

From: Karl Pentzlin (karl-pentzlin@acssoft.de)
Date: Sun Sep 26 1999 - 14:16:13 EDT


Software developers (especially those using the languages C, C++, Delphi et
al.) have to deal with byte sequences which must not contain any byte with
value 0, because 0 denotes the end of the byte sequence.
While there was no possibility to have a string of any 8-bit code (where all
characters are "encoded" by their byte values itself) containing a character
of value 0 (as long as you confine to the standard library functions for
strings), this may change when you go to encode your character sequences
using UTF-8 - as long you are allowed to encode U+0000 as 0xC0 0x80 (i.e.
11000000 10000000). If UTF-8 (for good reasons outside of software
development concerns) disallows 11000000 10000000, programmers are again
confronted with the problem being able to encode any value but U+0000 within
strings, although UTF-8 could solve this problem.

There are two possible solutions:

1. To allow two "conformance levels" of UTF-8:
a. "strict": U+0000 has to be encoded as 00000000
b. "special" (or named whatever): U+0000 may (or even may only) be encoded
as 11000000 10000000

2. To regard UTF-8 (-like) sequences which may contain 11000000 10000000 but
not 00000000 as "meta-encoding", i.e. the UTF-8 sequence is encoded by
itself: 00000000 is encoded as 11000000 1000000 which can be decoded
unequivocally to 00000000 as 11000000 10000000 is not a valid UTF-8 sequence
and therefore cannot stand for itself, as all other byte (sequence)s do.

In my opinion, this discussion should be continued together with the
standard bodies concerned with the programming languages.

Regards
Karl Pentzlin
AC&S Analysis Consulting & Software GmbH
Ganghoferstraße 128
D-81373 München, Germany

-----Ursprüngliche Nachricht-----
Von: Valeriy E. Ushakov <uwe@ptc.spbu.ru>
An: Unicode List <unicode@unicode.org>
Cc: Unicode List <unicode@unicode.org>; <linux-utf8@humbolt.geo.uu.nl>
Gesendet: Sonntag, 26. September 1999 19:11
Betreff: Re: New UTF-8 decoder stress test file

> On Sun, Sep 26, 1999 at 09:22:26AM -0700, Markus Kuhn wrote:
>
> > 4.3 Overlong representation of the NUL character
> >
> > The following five sequences should also be rejected like malformed
> > UTF-8 sequences and should not be treated like the ASCII NUL
> > character.
> >
> > 4.3.1 U+0000 = c0 80 = "?"
>
> I belive that's exactly what JDK uses to encode U+0000 in utf-8
> encoded NUL terminated C strings to distinguish U+0000 which is part
> of a string from the terminating NUL. I can't find the reference,
> though.
>
> SY, Uwe



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT