On Mon, 7 Aug 2000, Michael (michka) Kaplan wrote:
> Are you saying that a value made up of twelve 16-byte values that was
> actually six surrogates would be treated as:
>
> a) Six characters with unknown sort characteristics, or
>
> b) Twelve characters, at least six of which would have unknown sort
> characteristics (since the first two bytes of a surrogate would not have a
> defined sort order and the second two byte which might randomly coincide
> with an existing BMP value when treated as a separate Unicode code point.
I can't answer the question, but there is an erroneous preconception here.
Neither of the 16-bit units of a surrogate pair can coincide with any
existing BMP value.
> I would call (a) "surrogate aware", and (b) "surrogate safe", where "safe"
> would be defined as "at least the data did not get corrupted!". Obviously it
> is not entirely safe when you are considering collation and intrinsic string
> manipulation issues.
Every surrogate-unaware application is surrogate-safe in your limited
sense, unless it goes to the trouble of weeding out surrogates (which is
pointless). True surrogate-unsafeness appears when you allow things like
inserting characters into a string, in which case it is unsafe to
allow inserting after a high-part surrogate.
-- John Cowan cowan@ccil.org C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux, de rapport nyait pas. -- Jacques Lacan, "L'Etourdit"
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT