Re: Counting Codepoints

From: David Starner <>
Date: Mon, 12 Oct 2015 23:35:32 +0000

Any system that exposes Unicode strings (not UTF-16 strings) cannot have
two surrogates merge when two strings are appended. There's nothing in the
Unicode standard that says that should happen for a string in an arbitrary
format, and it's unreasonable behavior for a string. Thus a Unicode string
simply can't be in UTF-16 format internally with unpaired surrogates; a
Unicode string in a programmer opaque format must do something with broken
data on input.

On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham <> wrote:

> On Mon, 12 Oct 2015 17:29:13 +0200
> Philippe Verdy <> wrote:
> > But between two implementations
> > the result of the scanner could still be different because the
> > replacement character is not specified. If that result "sanitized"
> > string is then used to generate an URI, the URI is also unpredictable
> > and will vary between implementations, as well as its effective
> > length. If it is used to generate an identifier granting some new
> > access, such as a user name, several new user names could be
> > generated from the same input.
> TUS 8.0 Section 3 Requirement C10 has the following, wise words in its
> final paragraph:
> "However, such repair of mangled data is a special case, and it must
> not be used in circumstances where it would cause security problems."
> Richard.
Received on Mon Oct 12 2015 - 18:36:42 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 12 2015 - 18:36:42 CDT