Re: Corrigendum #9

From: Philippe Verdy <>
Date: Tue, 3 Jun 2014 00:55:31 +0200

"reserved for CLDR" would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound to the ICU implementation of CLDR for its own definitions and
syntax for its tailorings).

CLDR data defines its own interface and protocol, it can reserve these code
points only for itself but not in TUS and no other conforming plain-text
application is expected to accept these reservations, so they can
**freely** mark them in error, replace them, or filter them out, or
interpret them differently for their own usage, using their own
specification and encapsulation mechanisms and specific **non-plain-text**
data types.

CLDR data transmitted in binary form that would embed these code points are
not transporting plain-text, this is still a binary datatype specific to
this application. CLDR data must remain isolated in its scope without
forcing other protocols or TUS to follow its practices.

Other applications may develop "gateway" interfaces to convert them to be
interoperable with ICU but they are not required to do that. If they do,
they will follow the ICU specifications, not TUS and this should not
influence their own way to handle what TUS describe as plain-text.

To make it clear, it is referable to just say in TUS that the behavior of
applications with non-characters is completely undefined and unpredictable
without an external specification, and these entities should not even be
considered as encodable in any standard UTFs (which can be freely be
replaced by another one without causing any loss or modification of the
represented plain-text). It should be possible to define other (non
standard) conforming UTFs which are completely unable to represent these
non-characters (as well as any unpaired surrogate). A conforming UTF just
needs to be able to represent streams of scalar values in their full
standard range (even without knowing if they are assigned or not or without
knowing their character properties).

You can/should even design CLDR to completely ovoid the use of
non-characters: it's up to it to define an encapsulation/escaping mechanism
that clearly separates what is standard plain-text in the content and what
is not and used for specific purpose in CLDR or ICU implementations.

2014-06-03 0:07 GMT+02:00 Shawn Steele <>:

> Except that, particularly the max-weight ones, mean that developers can
> be expected to use this as sentinels in code using ICU, which would
> preclude their use for other things?
> Which makes them more like “reserved for use in CLDR” than “noncharacters”?
> -Shawn
> *From:* Unicode [] *On Behalf Of *Markus
> Scherer
> *Sent:* Monday, June 2, 2014 2:53 PM
> *To:* David Starner
> *Cc:* Unicode Mailing List
> *Subject:* Re: Corrigendum #9
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner <>
> wrote:
> I would especially discourage any web browser from handling
> these; they're noncharacters used for unknown purposes that are
> undisplayable and if used carelessly for their stated purpose, can
> probably trigger serious bugs in some lamebrained utility.
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
> markus
> _______________________________________________
> Unicode mailing list

Unicode mailing list
Received on Mon Jun 02 2014 - 17:57:19 CDT

This archive was generated by hypermail 2.2.0 : Mon Jun 02 2014 - 17:57:19 CDT