Re: What does it mean to "not be a valid string in Unicode"?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 7 Jan 2013 21:44:52 +0100

Well then I don't know why you need a definition of an "Unicode 16-bit
string". For me it just means exactly the same as "16-bit string", and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string is
exactly the same, a 16-bit string. The same also as Windows API 16-bit
strings, or "wide strings" in a C compiler where "wide" is mapped by a
compiler option to 16-bit code units for wchar_t (or "short" but more
safely as UINT16 if you don't want to be dependant of compiler options
or OS environments when compiling, when you need to manage the exact
memory allocation), or the same as a U-string in Perl.

Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with
concreate byte orders, without any leading BOM) is relevant to Unicode
because a 16-bit string does not itself specify any encoding scheme or
byte order.

One confusion comes with the name "UTF-16" when it is also used as an
encoding scheme with a possible leading BOM and implied default
UTF-16LE determined by guesses on the first few characters : this
encoding scheme (with support of BOM and implicit guess of byte order
if it's missing) should have been given a distinct encoding name like
"'UTF-16XE". Reserving "UTF-16" for what the stadnard discusses as a
"16-bit string", except that it should still require UTF-16
conformance (no unpaired surrogates and no non-characters) plus **no**
BOM supported for this level (which is still not materialized by a
concrete byte order or by an implicit size in storage bits, as long as
it can store distinctly the whole range of code units 0x0000..0xFFFF
minus the few non-characters, and enforces all surrogates to be
paired, but does not enforce any character to be allocated).

Note that such relaxed version of UTF-16 would still allow an internal
alternate representation of 0x0000 for interoperating with various
APIs without changing the storage requirement : 0xFFFF could perfectly
be used to replace 0x0000 if that last code units plays a special role
as a string terminator. But even if this is done, a storage unit like
0xFFFF would still be percied as if it was really the code unit
0x0000.

In other words, the concept of completely relaxed "Unicode 16-bit
string" is unneeded, given that it's single requirement is to make
sure that it defines a length in terms of 16-bit code units, and code
units being large enough to store any unsigned 16-bit value
(internally it could still be 18-bit on systems with 6-bit or 9-bit
addressable memory cells ; the sizeof() property of this code units
could still be 2, or 3 or other, as long as it is large enough to
store the value. On some devices (not so exotic...) there are memory
areas that is 4-bit addressable or even 1-bit addressable (in that
later case the sizeof() property for the code unit type would return
16, not 2). Some devices only have 16-bit or 32-bit addressable memory
and sizeof() would return 1 (and the C types char and wchar_t would
most likely be the same).

2013/1/7 Doug Ewell <doug_at_ewellic.org>:
> You're right, and I stand corrected. I read Markus's post too quickly.
>
> Mark Davis ☕ <mark at macchiato dot com> wrote:
>
>>> But still non-conformant.
>>
>> That's incorrect.
>>
>> The point I was making above is that in order to say that something is "non-conformant", you have to be very clear what it is "non-conformant" TO.
>>
>>> Also, we commonly read code points from 16-bit Unicode strings, and
>>> unpaired surrogates are returned as themselves and treated as such
>>> (e.g., in collation).
>>
>> + That is conformant for Unicode 16-bit strings.
>>
>> + That is not conformant for UTF-16.
>>
>> There is an important difference.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ­
>
>
>
Received on Mon Jan 07 2013 - 14:47:30 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 07 2013 - 14:47:31 CST