Re: Corrigendum #9

From: Philippe Verdy <>
Date: Tue, 3 Jun 2014 00:06:21 +0200

We can still draw a line : interchange should be meant so that other
non-Unicode standards should find their way to not mixup random data within
plain-text without defining a clear encapsulation and escaping mechanism
that ensures that plain text remains isolatable.
In other words, desieng separate layers of representation and processing,
and be more imaginative when you design an application or protocol with a
better modeling.
If an application really internaly needs some non-characters, this is not
reallyfor encoding text but for the application/protocol-specific system of
encapsulation, which should be clearly identified:
- these protocols can use separate APIs for handling objects that are
composite and contain some text but that are not text by themselves.
- they should isolate data types (or MIME types)
- they should use some "magic" identifiers in the headers of their data,
including versioning in their protocol
- they should document internally their own encapsulation/escaping
- they should test them to make sure they preserve the valid text content
without breaking them
As the kind of data is not text, we fall within the design of binary data

These kinds of statements mean that protocols and API will be improved for
better separation of layers, working more as separate blackboxes. But it's
not up to the Unicode standard to explain how they will do it.

So for me non-characters are not Unicode text, they are not text at all and
we should not attempt to make them legal if we want to allow string designs
of isolation mechanisms that allow this separation of layers. The Unicode
standard offers enough space for this separation, with non-characters
(invalid in all standard UTFs), with onvalid code sequences in standard
UTFs that allow building up specific encodings that must not be called
"UTFs" (or "Unicode" or "UCS" or other terms defined in TUS) and identified
as such in API/protocol designs.

Thnigs would be simply better is TUS did not even define what is a
non-character and if it dd not even suggest that they are legal in "some"
circonstance of text "interchange".

2014-06-02 18:08 GMT+02:00 Mark Davis ☕️ <>:

> The problem is where to draw the line. In today's world, what's an app?
> You may have a cooperating system of "apps", where it is perfectly
> reasonable to interchange sentinel values (for example).
> I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
> where we should make it clearer.)
> Mark <>
> *— Il meglio è l’inimico del bene —*
> On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele <>
> wrote:
>> I also think that the verbiage swung too far the other way. Sure, I
>> might need to save or transmit a file to talk to myself later, but apps
>> should be strongly discouraged for using these for interchange with other
>> apps.
>> Interchange bugs are why nearly any news web site ends up with at least a
>> few articles with mangled apostrophes or whatever (because of encoding
>> differences). Should authors’ tools or feeds or databases or whatever
>> start emitting non-characters from internal use, then we’re going to have
>> ugly leak into text “everywhere”.
>> So I’d prefer to see text that better permitted interchange with other
>> components of an application’s internal system or partner system, yet
>> discouraged use for interchange with “foreign” apps.
>> -Shawn
>> _______________________________________________
>> Unicode mailing list
> _______________________________________________
> Unicode mailing list

Unicode mailing list
Received on Mon Jun 02 2014 - 17:08:26 CDT

This archive was generated by hypermail 2.2.0 : Mon Jun 02 2014 - 17:08:26 CDT