Re: Origin of Ellipsis (was: RE: Empty set) from Stephan Stiller on 2013-09-15 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sun, 15 Sep 2013 21:35:33 -0700

Doug wrote me:
> You're not confusing "code point" with "code unit," are you?
Thanks for the note.

I think what you say is that I thought (or meant to write) "by first
representing the sequence of scalar values in an encoding form and then
counting [code points typecast from] code _units_". I think you are
right, but there are some points of confusion, see below. Somehow I
thought of "surrogate pair" as "pair of (surrogate) code points" instead
of "pair of (surrogate) code units". I guess that additional level of
indirection would make my interpretation (b) unlikely ... I think my
statement is still technically correct because counting code points for
UTF-16 and code units for UTF-16 leads to the same count.

What's confusing is a term like "high-surrogate code point" (see
glossary). If surrogate code points are not encoded, then they
practically don't exist in the ontology of Unicode terms, aside from
being holes in the scalar value range, if thought of as a subrange of
the integers.

In detail: The glossary defines "surrogate code point" as: "A Unicode
code point in the range U+D800..U+DFFF. Reserved _for use_ by UTF-16,
where _a pair of surrogate code units_ (a high surrogate followed by a
low surrogate) “stand in” for a supplementary code point." This
definition doesn't say much; it says they code _points_ are "for _use_
by UTF-16", but then UTF-16 uses surrogate code units, not surrogate
code points. C1 in TUS §3.2 says: "The high-surrogate and low-surrogate
code _points_ _are designated for_ surrogate code _units_ in the UTF-16
character encoding form." But the actual definitions used for UTF-16
don't seem to conceptually _derive_ "surrogate code unit" from
"surrogate code point". => ??

Still, I don't understand why people keep talking about code points. For
me conceptually (albeit not historically) everything starts with scalar
values (which are index values for certain abstract things). Scalar
values are then encoded by encoding forms (and then serialized in
encoding schemes). Why does everyone talk about the more generic "code
point" instead of "scalar value", when non-scalar-value code points
aren't used? (Because we're not using surrogate code point pairs, we're
instead using surrogate code unit pairs.) Anyways, I understand that
KenW and Mark Davis have pointed to earlier debates on this in an
earlier thread.

Stephan
Received on Sun Sep 15 2013 - 23:37:49 CDT

This archive was generated by hypermail 2.2.0 : Sun Sep 15 2013 - 23:37:49 CDT