Re: How does Python Unicode treat surrogates?

From: Rick McGowan (rick@unicode.org)
Date: Mon Jun 25 2001 - 12:41:54 EDT

Next message: J M Sykes: "Re: How does Python Unicode treat surrogates?"
Previous message: Magda Danish (Unicode): "FW: Arabic font Crisis!!!!!"
Maybe in reply to: Gaute B Strokkenes: "Re: How does Python Unicode treat surrogates?"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: How does Python Unicode treat surrogates?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Gaute B Strokkenes wrote...

> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments

Interesting... I just started looking at Python the other day, once I
discovered it has such nice built-in Unicode support.

If Python is explicitly storing the stuff as UTF-16 in u"" strings, then
slicing operations certainly should be acting on units of the backing
store, just as for ASCII "character" strings. In that case, in order for
every unit to be addressible, it should allow breaking up of surrogate
pairs. (Apple's Cocoa environment strings work the same way with
"ranges".) There should be another operation, or several, that slice up
strings based on other kinds of text element boundaries. For example, a
"slice on character boundaries" that would always shift the range to
accommodate surrogate pairs -- as a separate operation.

The low-level routines in Python, like slicing with absolute locations,
shouldn't presume to know about the encoding, only about the UNITS that are
in the "array".

In my opinion,
Rick

Begin forwarded message:

> This is completely and totally wrong. The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.

Next message: J M Sykes: "Re: How does Python Unicode treat surrogates?"
Previous message: Magda Danish (Unicode): "FW: Arabic font Crisis!!!!!"
Maybe in reply to: Gaute B Strokkenes: "Re: How does Python Unicode treat surrogates?"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: How does Python Unicode treat surrogates?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT