Re: Counting Codepoints

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Tue, 13 Oct 2015 19:53:29 +0100

On Tue, 13 Oct 2015 15:23:36 +0000
David Starner <prosfilaes_at_gmail.com> wrote:

> A UTF-16 string could delete one surrogate, or add a fractional
> character. A Unicode string (not a "UTF-16 string"), which could be
> stored internally in, say, a Python-like format which is Latin-1,
> UCS-2, or UTF-32, conversions made as needed and differences hidden
> from the user, can't.

Confusingly, the Unicode definitions are the other way round. A
UTF-16 string is a string of UTF-16 codepoints in which all surrogate
characters are paired surrogates. Any string of UTF-15 code units may
is a Unicode 16-bit string.

Richard.
Received on Tue Oct 13 2015 - 13:54:41 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 13:54:41 CDT