Re: Code point vs. scalar value from Philippe Verdy on 2013-09-17 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 18 Sep 2013 05:40:39 +0200

2013/9/18 Stephan Stiller <stephan.stiller_at_gmail.com>

> In what way does UTF-16 "use" surrogate code *points*? An encoding form
> is a mapping. Let's look at this mapping:
>
> - One *inputs* scalar values (not surrogate code points).
>
> In fact the input is one code point.

Then only if that code point has a scalar value (this may be tested or not
by the application), the rest of the algorithm applies.

The standard does not specify what the converter will do or if it will
produce some conforming UTF-16 on output. Applications may still do
everything they want in that case, provided that the output will be
conforming to the standard each time the input is conforming. In that case,
the application can claim conformance, even if it uses these unspecified
extensions (the application conformance is different from the conformance
of the ouput, given that an non-standard extension in a conforming
application can still produce conforming UTF-16 output... or not).

Even the simple fact of returning an error in the application can be
considered as a distinct ouput, which is also NOT part of the UTF-16
standard (UTF-16 contains nothing for encoding the concept of encoding
errors). So conforming applications are free to either: drop the offending
codepoint siliently, or generating some non-standard ouput, or replacing
that codepoint to another one that has a scalar value (the replacement
character is not specified in the UTF-16 standard), or output some
data/event to another out-of-band channel separated from the UTF-16 output
stream, or stopping the process (producing a output truncated prematurely,
or continuing but changing the status returned along with the UTF-16
output).
Received on Tue Sep 17 2013 - 22:42:58 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 17 2013 - 22:42:59 CDT