Re: Surrogates and noncharacters from Philippe Verdy on 2015-05-11 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 11 May 2015 21:25:29 +0200

Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit)
code unit in "32-bit strings", even if it is not a valid code point with a
valid scaar value in any legacy or standard version of UTF-32.

The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned
differences in 32-bit integers (if ever they were in fact converted to
larger integers such as 64-bit to exhibit differences in APIs returning
individual code units).

It's true that in 32-bit integers (signed or unsigned) you cannot
differenciate 0xFFFFFFF from -1 (which is generally the value chosen in
C/C++ standard libraries for representing the EOF condition returned by
functions or macros like getchar(). But EOF conditions do not require to be
differentiated when you are scanning positions in a buffer of 32-bit
integers (instead you compare the relative index in the buffer with the
buffer length, or the buffer object includes a separate method to test this
condition).

But today, where programming environment are going to 64-bit by default,
the APIs that return an integer when reading individual code positions will
return them as 64-bit integers, even if the inner storage uses 32-bit code
units: 0xFFFFFFFF will then be returned as a positive integer and not -1
used for EOF.

This was not still true when the legacy UTF-32 encoding was created, where
a majority of environments were still only running 32-bit or 16-bit code;
for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had
to be assigned to a non-character to limit problems of confusions with the
EOF condition in C/C++ or similar APIs in other languages (when they cannot
throw an exception instead of a distinct EOF value).

Well, there are stil la lot of devices running 32-bit code (notably in
guest VMs, and in small devices) and written in C/C++ with the old standard
C library, but without OOP features (such as exceptions, or methods for
buffering objects). In Java, the "int" datatype (which is 32-bit and
signed) has not been extended to 64-bit, even on platforms where 64-bit
integers are the internal datatype used by the JVM in its natively compiled
binary code.

Once again, "code units" and "x-bit strings" are not bound to any Unicode
or ISO/IEC 10646 or legacy RFC contraints related to the current standard
UTFs or legacy (obsoleted) UTF's.

And I still don't see any productive need for "Unicode x-bit strings" in
TUS D80-D83, when all that is needed for the conformance is NOT the whole
range of valid code units, but only the allowed range of scalar values
(which there's only the need for code units to be defined in a large enough
set of distinct values:

The exact cardinality of this set does not matter, and there can always
exist additional valid "code units" not bound to any valid "scalar value"
or to a minimal set of distinct "Unicode code units" needed to support the
standard Unicode encoding forms).

Even the Unicode scalar values or the implied values of "Unicode code
units" to not have to be aligned with the effective native values of "code
units" used in the lower level... except for the standard encoding schemes
for 8-bit interchanges, where byte order matters... but still not the lower
level bit order and the native hardware representation of invidually
addressable bytes which may be sometimes larger than 8-bit, with some other
control bits or framing bits, and sometimes even with variable bit sizes
depending on their relative position in transport frames !

2015-05-11 19:44 GMT+02:00 Doug Ewell <doug_at_ewellic.org>:

> Hans Aberg <haberg dash 1 at telia dot com> wrote:
>
> >>> However I wonder what would be the effect of D80 in UTF-32: is
> >>> <0xFFFFFFFF> a valid "32-bit string" ?
> >>
> >> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
> >> cannot represent a unit of encoded text in a UTF-32 string.
> >
> > Even though the values with highest bit set are not a part of original
> > UTF-32, it can easily be extended also to original UTF-8, which may be
> > simpler to implement.
>
> "Original UTF-8," regardless of where defined, only ever encoded scalar
> values up to 0x7FFFFFFF. See, for example, RFC 2279.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>
>
>
Received on Mon May 11 2015 - 14:27:03 CDT

This archive was generated by hypermail 2.2.0 : Mon May 11 2015 - 14:27:03 CDT