Re: How does Python Unicode treat surrogates?

From: Rick McGowan (rick@unicode.org)
Date: Mon Jun 25 2001 - 12:41:54 EDT


Gaute B Strokkenes wrote...

> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments

Interesting... I just started looking at Python the other day, once I
discovered it has such nice built-in Unicode support.

If Python is explicitly storing the stuff as UTF-16 in u"" strings, then
slicing operations certainly should be acting on units of the backing
store, just as for ASCII "character" strings. In that case, in order for
every unit to be addressible, it should allow breaking up of surrogate
pairs. (Apple's Cocoa environment strings work the same way with
"ranges".) There should be another operation, or several, that slice up
strings based on other kinds of text element boundaries. For example, a
"slice on character boundaries" that would always shift the range to
accommodate surrogate pairs -- as a separate operation.

The low-level routines in Python, like slicing with absolute locations,
shouldn't presume to know about the encoding, only about the UNITS that are
in the "array".

In my opinion,
        Rick

Begin forwarded message:

> This is completely and totally wrong. The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT