Re: Corrigendum #9 from Philippe Verdy on 2014-07-02 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 2 Jul 2014 21:19:16 +0200

2014-07-02 20:19 GMT+02:00 David Starner <prosfilaes_at_gmail.com>:

> I might argue 11111111b for 0x00 in UTF-8 would be technically
> legal

It is not. UTF-8 specifies the effective value of each 8-bit byte, if you
store 11111111b in that byte you have exactly the same result as when
storing 0xFF or -1 (unless your system uses "bytes" larger than 8-bits (the
time of PDP mainframes with 8-bit bytes is over since long, all devices
around use 8-bit byte values on their interface, even if they may
internally encode exposed bits with longer sequences, such as with MFM
encodings, or by adding extra control and clock/sync bits, or could use
three rotating sequences of 3 states with automatic synchronization by
negative or positive transitions at every encoded bit position, plus some
breaking rules on some bits to find start of packets)

the standard never specifies which bit sequences correspond to
> which byte values--but \xC0\x80 would probably be more reliably
> processed by existing code.

But the same C libraries are also using -1 as end-of-stream values and if
they are converted to bytes, they will be undistinctable from the NULL
character that could be stored everywhere in the stream.

The main reason why 0xC0,0x80 was chosen instead of 00 is historic in Java
when its JNI interface only used strings encoded on 8-bit sequences without
a separate parameter to specify the length of the encoded sequence. 0x00
was then used like in the basic ANSI C string library (string.h and
stdio.h) and Java was ported on heterogeneous systems (including those
small devices whose "int" type was also 8-bit only, blocking the use of
BOTH 0x00 and 0xFF in some system I/O APIs).

At least 0xC0,0x80 was safe (and not used by UTF-8, but at that time UTF-8
was still not even a standard previsely defined, and it was legal to
represent U+0000 as 0xC0,0x80, the prohibition of over long sequences in
UTF-8 or Unicode came many years later, Java used the early
informative-only RFC specification, which was also supported by ISO, before
ISO1646-1 and Unicode 1.1 were aligned).

The Unicode and ISO1646 have changed (both in incompatible way) but it was
necessary to have both standards compatible with each other. Java could not
change its ABI for JNI, it was too late.

However Java added another UTF16-based interface for strings to JNI. But
still this interface does not enforce UTF-16 rules about paired surrogates
(just like C, C++ or even Javascript). But the added 16-bit string
interface for JNI has a separate field for storing the encoded sting length
(in 16-but code units), so that interface uses the standard 0x0000 value
for U+0000. As much as possible JNI extension ibraries should use that
16-bit interface (which is simpler to handle also with modern OS APIs
compatible woth Unicode, notably on Windows). But the 8-bit iJNI interface
is still commonly used in JNI extension libraries for Unix/Linux (because
it is safer to handle the conversion from 16-bit to 8-bit in the JVM than
in the external JNI library using its own memory allocation and unable to
use the garbage collector of the managed memory of the JVM).

The Java-modified-UTF8 encoding is still used in the binary encoding of
compiled class files (this is invisible to applications that only see
16-bit encoded strings, unless they have to parse or generate compiled
class files)

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jul 02 2014 - 14:20:40 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 02 2014 - 14:20:41 CDT