From: Doug Ewell (dewell@roadrunner.com)
Date: Fri Jul 04 2008 - 16:15:49 CDT
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> Are you just considering the definitions, but also ignoring the
> conformance clauses that restrict them more precisely?
Jeroen asked a simple question, and I answered it:
"When you have the U+D800 - U+DFFF range for creating code points using
surrogate pairs and you take for example U+20045 it will be created as:
U+D840 U+DC45. Are these, by themselves only code units or are they also
code points?"
There is a lot of complexity to Unicode -- otherwise the book wouldn't
be 570 pages long BEFORE you get to the code charts -- but my answer was
not wrong. Yes, the values 0xD840 and 0xDC45 are code points. They are
surrogate code points, and they are only used in UTF-16, and they are
not Unicode scalar values, and they do not individually encode
characters, but that was not what Jeroen asked.
> Really, I prefer NEVER using the U+xxxx notation for anything else
> that is not mapped to a single code point, independantly of the
> encoding form or encoding scheme where those code points may be mapped
> to ordered streams of code units or bytes.
You are correct about the notation. U+... notation is only for use with
code points, not code values. I did not perceive the notation as being
at the heart of Jeroen's question, and in private exchange, he confirmed
that it was not. But you are correct about the notation.
> And I don't make the confusion between code points and code units
> because they don't belong to the same space (even if they seem to
> intersect, they don't: code points are arbitrary elements without
> numeric capabilities, so without arithmetic, even if they are assigned
> several numeroc properties like their nominal scalar value; code units
> have arithmetic properties, they are elements in a mathematical Galois
> field).
Well, gee, I don't like to "make the confusion" either, which is
probably why I opened the book before answering, instead of trusting my
instinct. Actually, my instinct was wrong on this: I was expecting to
see than the surrogates were not code points. In fact, they are not
Unicode scalar values. That is why that term was invented: (code
points) - (surrogate code points) = (USVs).
> UTF-16 defines an encoding form/scheme for conforming texts, not just
> for isolated characters.
Jeroen didn't ask about encoding U+D840 in isolation, or U+DC45 in
isolation.
> TUS is clear:
> "Each encoding form maps the Unicode code points U+0000..U+D7FF and
> U+E000..U+10FFFF to unique code unit sequences."
>
> This means that there's NO ***code points*** of the surrogates range
> U+D800..U+DFFF in any encodoing form, so they can't occur as well in
> UTF-16 (as long as you are conforming to its rules).
But they do exist as ***code points***. TUS is clear there too, in
definitions D9 and D10.
I'd like to wait for Ken or Mark or somebody to issue a bull on this. I
think I gave the correct answer to the question Jeroen asked, and you
are giving the correct answer for the question you think Jeroen really
meant to ask.
-- Doug Ewell * Arvada, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Fri Jul 04 2008 - 16:18:26 CDT