Re: Code point vs. scalar value

From: Hans Aberg <haberg-1_at_telia.com>
Date: Thu, 19 Sep 2013 15:32:54 +0200

On 18 Sep 2013, at 04:57, Stephan Stiller <stephan.stiller_at_gmail.com> wrote:

> In what way does UTF-16 "use" surrogate code points? An encoding form is a mapping. Let's look at this mapping:
> • One inputs scalar values (not surrogate code points).
> • The encoding form will output a short sequence of encoding form–specific code units. (Various voices on this list have stated that these should never be called code points.)
> • The algorithm mapping from input to output doesn't make use of surrogate code points. (Even though the Glossary states, under "Surrogate Code Point", that they are "for use by UTF-16".) The only "use" is indirect, through awareness of the positioning and size of the range of non-code-point scalar values.

This is in fact a mistake in the construction of UTF-16 that you observe. As you mention, the correct way is to define character numbers, plus a way to translate into binary format. This is how the original UTF-8 worked, the UNIX version. The current construction is legacy, so there is not much to do about it. Use UTF-8 or UTF-32 if you can.

Hans
Received on Thu Sep 19 2013 - 08:35:58 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 19 2013 - 08:36:00 CDT