From: Peter Constable (petercon@microsoft.com)
Date: Thu May 19 2005 - 15:10:12 CDT
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
On Behalf
> Of Dean Snyder
> Surely you are not denying that surrogates, BOM and annotation
> characters are stateful mechanisms?
Surrogates may be a stateful mechanism, but they are not a stateful
*character* mechanism. Annotation characters may be stateful, but they
are intended for use only within software processes, where state is not
an issue.
Sure, if someone sends me a file with a sequence < a, b, FFF9, c, d,
FFFA, e, f, FFFB >, I could cut and paste < d, FFFA, e > into some other
location, completely messing up the annotation syntax; but they
shouldn't be creating such content in the first place.
Sure, if an app that uses UTF-16 representation internally displays a
surrogate-pair sequence as a pair of boxes I could select a run
beginning or ending in the middle of such a pair and then make some
change that would produce garbage; but I don't expect to successfully
work on supplementary-plane text in an app that doesn't actually support
supplementary-plane text.
> And for that matter, I don't understand why you left out the bidi
> operators here, which I also mentioned. Do you consider them part of
the
> text content?
Yes; that is, they get processed at the same level of representation as
(say) "a"; they do not get processed in the same levels of
representation as (say) the BOM or surrogate code units.
> SURROGATES:
>
> The Unicode Standard 4.1, section 3.9
> "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as
> <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."
>
> How can you say that, for example, the surrogates in this very example
> in TUS are not used in text content?
By "text content" I was meaning the character content -- i.e. what is
recognized at the level of character interpretation. (IIUC, analogous, I
guess, to the notion of "infoset" used in relation to XML and SGML.)
D800 and DF02 are not characters; they are code units used in the UTF-16
encoding form. They may be part of a stream, but they are not
individually part of the character-information content of that stream.
Peter Constable
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 15:10:56 CDT