Re: Code point vs. scalar value

From: Philippe Verdy <>
Date: Thu, 19 Sep 2013 20:03:29 +0200

2013/9/19 Asmus Freytag <>

> The legacy difference was the existence of UCS-2 in parallel with UTF-16.
Correct. But UCS-2 is still not extinct, eve if it is no longer used for
exchanging interoperable plain-text.

UCS-2 remains widely used for storing arbitrary data in "strings", without
any one of the restrictions that must apply to UTF-16. Most UTF-16
libraries are still in fact more generic UCS-2 libraries that can be used
to process either pure UTF-16 or abitrary UCS-2. These libraries are still
conforming processes if, when given any compliant UTF input data they
always poduce compliant UTF output data.

Applications may still use some API to determine the compliance of the
input data, but applications are not required to assert this compliance
everytime. And the only place where the "scalar value property" matters is
only ducing conversions between standard UTFs. Internally when hanfling
text or even when enumerating each code point, its absolutely never matters
what is the scalar value property, if another binary value (e.g. a pointer
or reference address to a object containing the code point properties) may
be used which will facilitate the character handling or reencoding between
various UTFs.

The binary value may also still contain some additional state variables or
flags, such as the scalar value of the previous or next code point in the
text stream, or an end of file indicator, or a positional index, or the
current state of an output encoder/compressor (e.g. for SCSU). These extra
info are just like private fields in a object instance (in OO programming),
or some dirty flags (for objects that need to be preseved if swapped out,
or parity/CRC bits; where the scalar value is just a public field, or an
exposed computed property...
Received on Thu Sep 19 2013 - 13:06:45 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 19 2013 - 13:06:48 CDT