Re: Code point vs. scalar value

From: Stephan Stiller <>
Date: Tue, 17 Sep 2013 14:55:16 -0700

> It is the wording in your posts that adds to the confusion.
My fundamental point is, has been, and continues to be that whenever
people use the more general word "code point" instead of the more
appropriate "scalar value", that will "add to the confusion". If you
make the presupposition <>
that your sequence of "code points" or "scalar values" contains no
surrogate values, then, yes, this will be
> [DE:] truly a distinction without a difference
but if you're using these word without an explicitly stated
presupposition, then one will assume that when you mean "code point" you
do (surprise, surprise) actually mean "code point", which /according to
the official definitions/ will include "surrogate code points". I
mentioned this a while ago in a question about ICU, and KenW replied
that the real world contains bad data. I also think that this
> [DE:] it is very unlikely that Twitter and others are storing and interchanging loose surrogates
is incorrect. Not sure whether the Twitter hack I linked to made use of
/loose/ surrogates, but it was based on encoding and storing surrogates.

> [some paragraphs terminating in:]
> Some people writing end user materials may have shown terminological
> muddle

Sorry to say, but that's apparently the way Twitter misconstrued it. The
alternative to a characterization of the way they've interpreted the
word "code point" (which is rather un-crazy, but then you're minimizing
in your email the extent to which such interpretations or
"mis"construals exist online) is to say that Twitter has been, for a
long time, /blatantly/ wrong in their official attempt at clarifying the
details of the distinguishing feature of their product, after having the
product out for an even longer time.

 From time to time I will encounter products that appear to handle
Unicode but whose string handling gets deeply confused once you
enter/paste anything beyond the BMP; you can blame this on confusing
"code point" with "code unit" instead, but if the first word didn't
exist (because it shouldn't), there would be no confusion.

This qualification
> [AF:] by those who have the requisite technical background
of this statement
> [AF:] to insinuate that the definitions are widely confused
of course makes it true. As long as "high-surrogate code point" and
"low-surrogate code point" aren't officially deprecated, confusion will
persist. They should be deprecated, because, /as you say/:
> [AF:] Once you add the UTF-prefix, you are, by force, speaking of code
> units.
So the high-low distinction for "surrogate" code points is misleading,
and the "surrogate" attribute for "code point" shouldn't be there,
because, as I've in fact written in a much earlier thread and as people
know, surrogates are UTF-16-specific.

Received on Tue Sep 17 2013 - 16:58:40 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 17 2013 - 16:58:42 CDT