Re: Corrigendum #9 clarifies noncharacter usage in Unicode

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 22 Feb 2013 23:01:49 +0000

On Thu, 21 Feb 2013 15:26:09 -0800
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Thu, Feb 21, 2013 at 2:12 PM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:

> Nothing requires a library that processes 16-bit Unicode strings to
> have a 16-bit type for a single-character return value. Just like the
> C standard getc() returns a *negative* EOF value, in an integer type
> that is wider than a byte.

0xFFFF for WEOF looks like a hang-over from 16-bit int; changing from
it does not seem easy. Fortunately, one can successfully read past
U+FFFF in a file, unlike ctrl/Z in a DOS text file.

> The UTC is now applying additional pressure for the making of the
> > distinction between UTF-16 and UTF-16LE.
 
> The UTC is doing no such thing. Nothing has changed with regard to the
> UTF-16 encoding scheme and the BOM.

I didn't say the application of pressure was deliberate.

> U+FFFE has always been a code point that will never have a real
> character assigned to it, that's why it is *unlikely* to appear as
> the first character in a text file and thus useful as a "reverse
> BOM". However, it was never forbidden from occurring in the text.

It's support was not encouraged, and it was forbidden from interchanged
text. This particular noncharacter is still forbidden in XML Version
1.0.

TUS 1.0.0 Section 2.4 forbade U+FFFE and U+FFFF. TUS 2.0.0 Section 2.3
is less strict:

"Two codes are not used to encode characters: U+FFFF is reserved for
internal use (as a sentinel) and should not be transmitted or stored
as part of plain text. U+FFFE is also reserved. Its presence may
indicate byte-swapped Unicode data."

That paragraph legitimised the use of 0xFFFF for WEOF. Note that wint_t
and wchar_t are explicitly allowed to be the same type; what is
required is that no character be encoded by WEOF.

> Best practice for file encodings has always been to declare the
> encoding.

In general it can't be declared in the plainest of plain text, except
possibly as a file attribute separate to the file content.

> Second best for UTF-16 is to always include the BOM, even if the byte
> order is big-endian. And since most computers are little-endian, they
> need to include the BOM in UTF-16 file encodings anyway (if they use
> their native endianness).

A higher-order protocol seems to work fine. At least, it did with
Notepad on Windows XP: Windows 7 seems to be applying some
content-based checking.

Richard.
Received on Fri Feb 22 2013 - 17:03:25 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 22 2013 - 17:03:36 CST