Re: Regulating PUA.

From: Philippe Verdy (
Date: Mon Jan 22 2007 - 03:14:02 CST

  • Next message: Gilbert Sneed: "Re: Graphics for unicode scripts"

    From: +ACI-Ruszlan Gaszanov+ACI-
    +AD4- Unicode standard currently allows PUA (private use areas) for whatever anyone might want to use it. This tends to create problems, since, sometimes PUA code points are used for process-internal purposes and other times - for storing non-standardized character data. Because there is no way for users with a need to represent non-standardized character data to know which PUA code points some application might use for process-internal purposes and there is no way for an application designer which PUA code points someone might chose for encoding non-standard characters, this can various create issues.

    This only causes issues because you have forgotten one important requirement with their use: the existence of an explicit private agreement+ADs- if you don't manage to keep the existence of thir agreement by encoding it somewhere in your data model, then you are exposed to such issues.

    In other words: don't interchange any PUAs without specifying explicitly their meaning, within some external metadata+ACE- PUAs have not been made for interchange alone. They are allowed in standard UTFs only to allow building a transmissionformat within which they will be referenced and transmitted +ACo-along with+ACo- the explicit private agreement.

    This meansthat if you send plain-text documents, that don't have any placeto put such metadata internally, these plan-text documents should not contain +ACo-any+ACo- PUA, or the plain-text will be unusable as long as the private agreement is not specified separately (for example as part of an HTTP header, or as part of the specification of a transport protocol which explicits the meaning).

    Unfortunately, there are products that make use of PUAs without explicit tagging:
    +ACo- The most wellknown products doing that are fonts. To find their actual meaning, you have to look into private structures made by font vendors or foundries (but fortunately, this font vendor or foundy info is often specified in the font itself, especially those in TrueType/OpenType formats, so the various OSes can adapt themselves).
    +ACo- The other kind of products using PUAs without explicitagreement are OSes themselves, in their shaping engine.
    But unfortunately, there is still no filter within browsers to block untagged PUAs to get the meaning implied by the font or by the shaping engine without an explicit agreement (for example a Javascript instruction in an HTML page). So PUAs are +ACI-leaking+ACI- in places where they should not be present, and users tend to think they are part of some supported standard, and complain when this produces conflicts later.

    For me, a pure HTML document that contains PUAs should be declared invalid or should remap all PUAs to non-characters with the same default +ACI-square box+ACI-, whatever their source, unless the HTMLdocuments instructs explicitly the browser to use a specific PUA convention.

    I consider this +ACI-PUA leakage bug+ACI- as a serious security issue, much more serious than a interoperability problem (which it is not). To close this bug, a protocol should be created to allow transmitting those private convention agreements, in addition to the character encoding information. This could take the form of a standard meta-data attribute, whose value is a URI in a non-local namespace (for example an URN like a UUID, or an URL containing the Internet domain name of the convention creator). The exact value of this URI would not bother: this would be used as an opaque string like namespace specifiers in XML, adhering to standard URI conventions, and starting by a registered URI-scheme prefix...

    This archive was generated by hypermail 2.1.5 : Mon Jan 22 2007 - 03:16:05 CST