Re: Code point vs. scalar value

From: Asmus Freytag <>
Date: Thu, 19 Sep 2013 08:37:35 -0700

On 9/19/2013 6:32 AM, Hans Aberg wrote:
> On 18 Sep 2013, at 04:57, Stephan Stiller <> wrote:
>> In what way does UTF-16 "use" surrogate code points? An encoding form is a mapping. Let's look at this mapping:
>> • One inputs scalar values (not surrogate code points).
>> • The encoding form will output a short sequence of encoding form–specific code units. (Various voices on this list have stated that these should never be called code points.)
>> • The algorithm mapping from input to output doesn't make use of surrogate code points. (Even though the Glossary states, under "Surrogate Code Point", that they are "for use by UTF-16".) The only "use" is indirect, through awareness of the positioning and size of the range of non-code-point scalar values.
> This is in fact a mistake in the construction of UTF-16 that you observe. As you mention, the correct way is to define character numbers, plus a way to translate into binary format. This is how the original UTF-8 worked, the UNIX version. The current construction is legacy, so there is not much to do about it. Use UTF-8 or UTF-32 if you can.
The legacy difference was the existence of UCS-2 in parallel with UTF-16.

Received on Thu Sep 19 2013 - 10:40:00 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 19 2013 - 10:40:00 CDT