RE: Code point vs. scalar value

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Thu, 19 Sep 2013 21:21:46 +0000

Stephan Stiller seems unconvinced by the various attempts to explain the situation. Perhaps an authoritative explanation of the textual history might assist.

Stephan demands an answer:

I want to know why the Glossary claims that surrogate code points are "[r]eserved for use by UTF-16".

Reason #1 (historical): Because the Glossary entry for “Surrogate Code Point” has been worded thusly since Unicode 4.0 (p. 1377), published in 2003, and hasn’t been reworded since.

Reason #2 (substantive): Because UTC members have been satisfied with the content of the statement and have not required it be changed in subsequent versions of the standard.

Reason #3 (intentional): Because the wording was added in the first place as part of the change to identify the term “surrogate character”, which had been widely used before, as a misnomer and a usage to be deprecated. The term “surrogate code point” was a deliberate introduction at that time to refer specifically to the range U+D800..U+DFFF of “code points” which could *not* be used to encode abstract characters.

Reason #4 (proximal): Because nobody recently has submitted a suggested improvement to the text of the relevant entry in the glossary (and associated text in Chapter 3) which has passed muster in the editorial committee and been considered to be an improvement on the text.

If it is exegesis rather than textual history that concerns you, here is what I consider to be a full explanation of the meaning of the text that troubles you so:

Code points in the range U+D800..U+DFFF are reserved for a special purpose, and cannot be used to encode abstract characters (thereby making them encoded characters) in the Unicode Standard. Note that it is perfectly valid to refer to these as code points and use the U+ prefix for them. The U+ prefix identifies the Unicode codespace, and the glossary (correctly) identifies that as the range of integers from 0 to 10FFFF. O.k., if the range of code points U+D800..U+DFFF are reserved for a special purpose, what is that purpose and how do we designate the range? The designation is easy: we call elements of the subrange U+D800.. U+DBFF “high-surrogate code point” (see D71) and the elements of the subrange U+DC00..U+DFFF “low-surrogate code point” (see D73), and by construction (and common usage), the elements contained in the union of those two subranges is called “surrogate code point”. What is the special purpose? The shorthand description of the purpose is that the “surrogate code points” are “used for UTF-16”. But since that seems to confuse a minority of the readers of the standard, here is a longer explication: The surrogate code points are deliberately precluded from use to encode abstract characters to enable the construction of an efficient and unambiguous mapping between Unicode scalar values (the U+0000..U+D7FF, U+10000..U+10FFFF subranges of the Unicode codespace) and the sequences of 16-bit code units defined in the UTF-16 encoding form. In other words, the reservation *from* encoding for the code points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF to define surrogate pairs to map U+10000..U+10FFFF, while otherwise retaining a simple one-to-one mapping from code point to code unit in UTF-16 for the BMP code points which *are* used for encoding abstract characters. In short, the surrogate code points are “used for UTF-16”.

Stephan’s next demand for an answer was:

Remind me real quick, in what way does a function "use" the input values that it's not defined on?

Well, the problem here is in the formulation of the implied question. I suspect, from the discussion in this thread, that Stephan has concluded that the generic wording “used for” in the glossary item in question necessary imputes that the surrogate code points are therefore elements of the domain of the mapping function for UTF-16 (which maps Unicode scalar values to sequences of UTF-16 code units). Of course that imputation is incorrect. Surrogate code points are excluded form that domain, by *definition*, as intended. And I have explained above what the phrase “used for” is actually used for in the glossary entry.

Finally:

And what does this have to do with UTF-16?

It is definitional for UTF-16. I think that should also be clear from the explanation above.

Now, rather than quibbling further about what the glossary says, if the explanation still does not satisfy, and if the text in the glossary (and in Chapter 3) still seems wrong and misleading in some way, here is a more productive way forward:

Submit a proposal for a small textual change to the Unicode Technical Committee. This can either consist of an extended document (if long), or can be done on the online contact form (if short). (See the web site for submission details.) In a case like this, to be effective, such a proposal should have the following rhetorical structure, approximately:

===========================================================================================

1. I find (glossary entry/conformance clause/section/page/…) (confusing/misleading/erroneous…) for XYZ reasons.

2. The following reformulation of that text [insert exact text suggestion here] might be a useful improvement.

3. Please consider this suggestion at your next available opportunity.

Sincerely, etc., etc., with appropriate contact information

===========================================================================================

Anyone who wants to make an actual textual improvement to the standard can follow that general outline.

If, on the other hand, the goal here is simply to have a rousing argument for argument’s sake on the email list, at a certain point, others on the list may conclude that enough is enough. It might be time then to take the argument private to those individual correspondents who wish to continue the argument.

--Ken



You haven't answered my questions. I want to know why the Glossary claims that surrogate code points are "[r]eserved for use by UTF-16". Remind me real quick, in what way does a function "use" the input values that it's not defined on? And what does this have to do with UTF-16?


Stephan
Received on Thu Sep 19 2013 - 18:04:17 CDT

This archive was generated by hypermail 2.2.0 : Thu Sep 19 2013 - 18:04:18 CDT