Re: Code point vs. scalar value

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 19 Sep 2013 00:14:21 +0200

2013/9/18 Stephan Stiller <stephan.stiller_at_gmail.com>

> On 9/18/2013 2:42 AM, Philippe Verdy wrote:
>
> There are "scalar values" used in so many other unrelated domains [...]
>
> There is no risk for confusion with vectors or complex numbers or reals or
> whatnot.
>

Yes there are such risks. I gave a meaningful example with formatting or
parsing libraries. Historically the terms "scalar value" (often shortened
"scalar") have been (and are still) widely used in mathematics. And today
mathematics are widely performed on computers, making computations either
symbolically or numerically, or with estimation heursitics, or through
massive simulations.

These mathematical operations will need to have input (so they'll use
parsers or data sensors) and output (so they'll use formatters, not limited
to only plain-text, this could be rich-text or colorful graphics as well)..
But most of these input and output data will be textual, that wil also need
to be encoded. The more universal mathematical concept of scalar values
will then collide heavily with the specific internal definition of "scalar
values" only used to define a small definition domain for standard UTF
conversions.

That's why I would propose exactly the opposite of what you want: avoid
using "scalar value" alone. But only speak about 'Unicode scalar value
character property". But it could as well be removed completely from the
definitions (including deprecating it completely as a "character
property"). It would mean that all code points have an associated integer
value within an unrestricted range (just large enough to distinguish all
values between U+0000 and U+10FFFF). The restriction of ranges would ONLY
apply in the internal description of standard UTFs.

But I think that the definition of "scalar values character property" was
only done to save common texts that would otherwise need to be replicated
in the description of each standardized UTF: i.e. UTF-8, UTF-16(BE/LE),
UTF-32(BE/LE), CESU, BOCU, SCSU... or even the (deprecated?) UTF-7. It
should now also be used with other "legacy" standards such as GB18030
(which should be compatible, at least for now in its last version, with the
standard UTFs published by Unicode and ISO/IEC/IETF).

IMHO, this definition should only be moved just in the capter that present
these standard UTFs (or other non-standard UTFs but that respect some
minimum condition, which is being able to represent reversibly any code
point, *including* unpaired surrogates, even if documents containing
surrogate code points could not be conforming, but no warranty about those
surrogates for beng distinguished if they are paired or not).

In my opinion, the terms "scalar value" used alone are definitely confusive
(even if we add "Unicode" because The Unicode Consortium also hosts he CLDR
projects and also frequently speeks about interoperability with
mathematics, and also includes many direct references to mathematics in its
core standard), and I would simply prefer "interoperable code point" (with
a basic statement in introducion of the description of each UTF fixing the
interoperability conditions for all standard UTFs or for any other
conforming UTFs).
Received on Wed Sep 18 2013 - 17:18:02 CDT

This archive was generated by hypermail 2.2.0 : Wed Sep 18 2013 - 17:18:03 CDT