Re: Encodings for SQL Databases

From: John Cowan (cowan@locke.ccil.org)
Date: Mon Aug 07 2000 - 19:38:08 EDT


On Mon, 7 Aug 2000, Michael (michka) Kaplan wrote:

> Are you saying that a value made up of twelve 16-byte values that was
> actually six surrogates would be treated as:
>
> a) Six characters with unknown sort characteristics, or
>
> b) Twelve characters, at least six of which would have unknown sort
> characteristics (since the first two bytes of a surrogate would not have a
> defined sort order and the second two byte which might randomly coincide
> with an existing BMP value when treated as a separate Unicode code point.

I can't answer the question, but there is an erroneous preconception here.
Neither of the 16-bit units of a surrogate pair can coincide with any
existing BMP value.

> I would call (a) "surrogate aware", and (b) "surrogate safe", where "safe"
> would be defined as "at least the data did not get corrupted!". Obviously it
> is not entirely safe when you are considering collation and intrinsic string
> manipulation issues.

Every surrogate-unaware application is surrogate-safe in your limited
sense, unless it goes to the trouble of weeding out surrogates (which is
pointless). True surrogate-unsafeness appears when you allow things like
inserting characters into a string, in which case it is unsafe to
allow inserting after a high-part surrogate.

-- 
John Cowan                                   cowan@ccil.org
C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant
le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux,
de rapport nyait pas.               -- Jacques Lacan, "L'Etourdit"



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT