Re: Corrigendum #9 clarifies noncharacter usage in Unicode from Markus Scherer on 2013-02-21 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 21 Feb 2013 11:52:07 -0800

On Thu, Feb 21, 2013 at 11:06 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> On Wed, 20 Feb 2013 12:49:39 -0800
> announcements_at_unicode.org wrote:
>
> > They should be supported by APIs, components, and
> > applications that handle (i.e., either process or pass through) all
> > Unicode strings, such as a text editor or string class. Where an
> > application does make internal use of a noncharacter, it should take
> > some measures to sanitize input text from unknown sources.
>
> Does this mean that a general purpose application written in C that uses
> Microsoft's 16-bit wchar_t to handle little-endian UTF-16 input using
> the fgetwc() function should be regarded as broken? The problem is
> that a return value of 0xFFFF means not non-character U+FFFF, but end
> of file!
>

"fgetwc returns, as a
wint_t<http://msdn.microsoft.com/en-us/library/323b6b3k.aspx>,
the wide character that corresponds to the character read or returns WEOF to
indicate an error or end of file. For both functions, use feof orferror to
distinguish between an error and an end-of-file condition."
http://msdn.microsoft.com/en-us/library/c7sskzc1.aspx

In other words, the wint_t value WEOF is supposed to be out-of-range for
normal characters, and if in doubt, the API docs tell you to call feof().

On my Ubuntu laptop, wchar.h defines WEOF=0xffffffffu which is thoroughly
out of range for Unicode.

The comment for *wint_t* says
/* Integral type unchanged by default argument promotions that can
hold any value corresponding to members of the extended character
set, as well as *at least one value that does not correspond to any*
* member of the extended character set*. */

I don't have a Windows system handy to check for the value there. I assume
that it follows the standard:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/wchar.h.html says:
*wint_t*An integral type capable of storing any valid value of *wchar_t*,
or *WEOF*.

WEOFConstant expression of type *wint_t* that is returned by several WP
functions to indicate end-of-file.

Similarly, the C standard library defines EOF=*-1*, precisely so that it
cannot be mistaken for a real contents byte.

A negative sentinel value has the benefit that you need not check for
equality but can just test "<0" which makes for shorter source code and
also slightly smaller and faster machine code.

If you use an in-range value for end-of-input or something like that, then
you get into trouble. That is trivially the case, and has nothing to do
with Unicode.

U+FFFE at the start of a UTF-16 file should also cause some headaches!
> Doesn't Microsoft Windows still interpret this as a byte-order mark
> without asking whether there may be a byte-order mark?
>

In the UTF-16 *encoding scheme*, such as in an otherwise unmarked file, the
leading bytes FF FE and FE FF have special meaning. Again, this has nothing
to do with the first character in a string in code. None of this has
changed.

markus
Received on Thu Feb 21 2013 - 13:54:16 CST

This archive was generated by hypermail 2.2.0 : Thu Feb 21 2013 - 13:54:17 CST