Re: Surrogate space in Unicode

From: DougEwell2@cs.com
Date: Fri Feb 16 2001 - 11:44:25 EST


In a message dated 2001-02-16 7:56:12 Pacific Standard Time,
mike.sykes@acm.org writes:

> It's clearer, but misses what I understand to be the absolutely crucial
> distinction between a code point (correctly defined) and a code unit
> (mentioned by Mark but not by Doug). For what a code unit is, see
> http://www.unicode.org/unicode/reports/tr17

I didn't mention code units because, embarrassingly, I am still having a hard
time telling the difference between code points and code units. I have read
UTR #17 many times and am still somewhat confused. I'll try again.

> I would question whether 'surrogate code points' are really code points. In
> the sense that they are a subset of 'code points' as defined, I guess they
> are; but they are not only unlike every other code point in that they "do
> not directly represent characters", they are explicitly and inexorably
> disqualified from so doing, being reserved for use, in pairs, as UTF-16
code
> units. (Which is what Mark said, of course.)

I think they would still be code points, just like 0xFFFE and 0xFFFF (and now
others) which are guaranteed never to be characters, for a different reason.

> Looked at in this way, surely it makes it clearer that the transcoding of a
> surrogate (code point) into UTF-8 is an abomination.
>
> Simplification is all very well, but it can be taken too far, as when
> important distinctions are lost.

Yes, that is true. I might have known better than to respond to a "cut the
mumbo-jumbo" post. Einstein said, "Everything should be made as simple as
possible, but not one bit simpler," and I think that is especially true when
working with standards and specifications, where precise and unambiguous
wording is crucial.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT