Re: Corrigendum #9 from Richard Wordingham on 2014-06-12 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 12 Jun 2014 19:28:45 +0100

On Thu, 12 Jun 2014 01:37:49 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson
> <public_at_khwilliamson.com> wrote:

> > The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not
> > realize that that was considered representable in any UTF.
> > Likewise -1.

> No, and that's the point of using those. Integer values that are not
> code points make for great sentinels in API functions, such as a
> next() iterator returning -1 when there is no next character.

They work fine as alternatives to scalar values. They don't work so
well in 8-bit and 16-bit Unicode strings. A general purpose routine
extracting scalar values from Unicode strings is likely to treat them
as errors rather than just returning the scalar value as it would for
a non-character. The only way to use them directly in 8- and
16-bit Unicode strings is to deliberately create ill-formed Unicode
strings.

Thus, these 'sentinels' are not full blown sentinels like U+0000 in the
C conventions for 'strings', as opposed to arrays of char.

There is a get-out clause - just never accept that a Unicode string is
purported to be in a Unicode character encoding form.

Richard.

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Thu Jun 12 2014 - 13:30:13 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 12 2014 - 13:30:14 CDT