Re: Code point vs. scalar value from Mark Davis ☕ on 2013-09-20 (Unicode Mail List Archive)

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Fri, 20 Sep 2013 12:33:24 +0200

Nicely stated.

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Thu, Sep 19, 2013 at 11:21 PM, Whistler, Ken <ken.whistler_at_sap.com>wrote:

> Stephan Stiller seems unconvinced by the various attempts to explain the
> situation. Perhaps an authoritative explanation of the textual history
> might assist.****
>
> ** **
>
> Stephan demands an answer:****
>
> ** **
>
> I want to know why the Glossary claims that surrogate code points are
> "[r]eserved for use by UTF-16".****
>
> ** **
>
> Reason #1 (historical): Because the Glossary entry for “Surrogate Code
> Point” has been worded thusly since Unicode 4.0 (p. 1377), published in
> 2003, and hasn’t been reworded since.****
>
> ** **
>
> Reason #2 (substantive): Because UTC members have been satisfied with the
> content of the statement and have not required it be changed in subsequent
> versions of the standard.****
>
> ** **
>
> Reason #3 (intentional): Because the wording was added in the first place
> as part of the change to identify the term “surrogate character”, which had
> been widely used before, as a misnomer and a usage to be deprecated. The
> term “surrogate code point” was a deliberate introduction at that time to
> refer specifically to the range U+D800..U+DFFF of “code points” which could
> *not* be used to encode abstract characters.****
>
> ** **
>
> Reason #4 (proximal): Because nobody recently has submitted a suggested
> improvement to the text of the relevant entry in the glossary (and
> associated text in Chapter 3) which has passed muster in the editorial
> committee and been considered to be an improvement on the text.****
>
> ** **
>
> If it is exegesis rather than textual history that concerns you, here is
> what I consider to be a full explanation of the meaning of the text that
> troubles you so:****
>
> ** **
>
> Code points in the range U+D800..U+DFFF are reserved for a special
> purpose, and cannot be used to encode abstract characters (thereby making
> them encoded characters) in the Unicode Standard. Note that it is perfectly
> valid to refer to these as code points and use the U+ prefix for them. The
> U+ prefix identifies the Unicode codespace, and the glossary (correctly)
> identifies that as the range of integers from 0 to 10FFFF. O.k., if the
> range of code points U+D800..U+DFFF are reserved for a special purpose,
> what is that purpose and how do we designate the range? The designation is
> easy: we call elements of the subrange U+D800.. U+DBFF “high-surrogate code
> point” (see D71) and the elements of the subrange U+DC00..U+DFFF
> “low-surrogate code point” (see D73), and by construction (and common
> usage), the elements contained in the union of those two subranges is
> called “surrogate code point”. What is the special purpose? The shorthand
> description of the purpose is that the “surrogate code points” are “used
> for UTF-16”. But since that seems to confuse a minority of the readers of
> the standard, here is a longer explication: The surrogate code points are
> deliberately precluded from use to encode abstract characters to enable the
> construction of an efficient and unambiguous mapping between Unicode scalar
> values (the U+0000..U+D7FF, U+10000..U+10FFFF subranges of the Unicode
> codespace) and the sequences of 16-bit code units defined in the UTF-16
> encoding form. In other words, the reservation *from* encoding for the code
> points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF
> to define surrogate pairs to map U+10000..U+10FFFF, while otherwise
> retaining a simple one-to-one mapping from code point to code unit in
> UTF-16 for the BMP code points which *are* used for encoding abstract
> characters. In short, the surrogate code points are “used for UTF-16”.****
>
> ** **
>
> Stephan’s next demand for an answer was:****
>
> ** **
>
> Remind me real quick, in what way does a function "use" the input values
> that it's not defined on?****
>
> ** **
>
> Well, the problem here is in the formulation of the implied question. I
> suspect, from the discussion in this thread, that Stephan has concluded
> that the generic wording “used for” in the glossary item in question
> necessary imputes that the surrogate code points are therefore elements of
> the domain of the mapping function for UTF-16 (which maps Unicode scalar
> values to sequences of UTF-16 code units). Of course that imputation is
> incorrect. Surrogate code points are excluded form that domain, by
> *definition*, as intended. And I have explained above what the phrase “used
> for” is actually used for in the glossary entry.****
>
> ** **
>
> Finally:****
>
> ** **
>
> And what does this have to do with UTF-16?****
>
> ** **
>
> It is definitional for UTF-16. I think that should also be clear from the
> explanation above.****
>
> ** **
>
> Now, rather than quibbling further about what the glossary says, if the
> explanation still does not satisfy, and if the text in the glossary (and in
> Chapter 3) still seems wrong and misleading in some way, here is a more
> productive way forward:****
>
> ** **
>
> Submit a proposal for a small textual change to the Unicode Technical
> Committee. This can either consist of an extended document (if long), or
> can be done on the online contact form (if short). (See the web site for
> submission details.) In a case like this, to be effective, such a proposal
> should have the following rhetorical structure, approximately:****
>
> ** **
>
>
> ===========================================================================================
> ****
>
> ** **
>
> 1. I find (glossary entry/conformance clause/section/page/…)
> (confusing/misleading/erroneous…) for XYZ reasons.****
>
> ** **
>
> 2. The following reformulation of that text [insert exact text suggestion
> here] might be a useful improvement.****
>
> ** **
>
> 3. Please consider this suggestion at your next available opportunity.****
>
> ** **
>
> Sincerely, etc., etc., with appropriate contact information****
>
> ** **
>
>
> ===========================================================================================
> ****
>
> ** **
>
> Anyone who wants to make an actual textual improvement to the standard can
> follow that general outline.****
>
> ** **
>
> If, on the other hand, the goal here is simply to have a rousing argument
> for argument’s sake on the email list, at a certain point, others on the
> list may conclude that enough is enough. It might be time then to take the
> argument private to those individual correspondents who wish to continue
> the argument.****
>
> ** **
>
> --Ken****
>
> ** **
>
> ** **
>
> ** **
>
> You haven't answered my questions. I want to know why the Glossary claims
> that surrogate code points are "[r]eserved for use by UTF-16". Remind me
> real quick, in what way does a function "use" the input values that it's
> not defined on? And what does this have to do with UTF-16?
>
>
> ****
>
> Stephan****
>
Received on Fri Sep 20 2013 - 05:35:42 CDT

This archive was generated by hypermail 2.2.0 : Fri Sep 20 2013 - 05:35:44 CDT