Re: Defined Private Use was: SSP default ignorable characters

From: Philippe Verdy (
Date: Wed Apr 28 2004 - 15:29:23 EDT

  • Next message: Kenneth Whistler: "Re: PUA as the Wild West [was: SSP default ignorable characters]"

    From: "Peter Kirk" <>
    > Software developers, or applications, are not supposed to be party to
    > the agreement between *users*.

    Do you say there that software developers are failing to comply with Unicode
    rules by refusing to develop systems that allow *users* to make such private
    private agreements and use the PUAs effectively as they are legitimately in
    right to ask to their software developers?

    Interesting point. This would be an argument for the developement (out of
    Unicode) of some standard technical solutions to exchange these private
    conventions on PUA usage, including exchange of character properties, etc...

    Why not then within fonts -- namely in Opentype tables for fonts built with
    these PUA assignments?

    If so, a fully Unicode-compliant system should offer ways to allow interchange
    of data between parties of these private agreements, and ensure that the PUA
    encoding conventions are isolated and kept within the domain of the private
    agreement (for example by labelling documents, with tags containing a URI,
    either by out of band encoding in rich text formats such as XML or precomposed
    PDF files, oe either in band within the encoded text using special tags, in a
    way similar to language tags, but currently Unicode has not defined such an area
    in plane 14 for other use than just language tags).

    I note however that language tags (even if they are discouraged by Unicode) are
    not deprecated and that they could even be used according to the RFC 3066
    encoding format or one of its extensios, to cover as well additional attributes
    identifying private communities sharing a common agreement.

    So with Unicode language tags containing a standard language code and attributes
    such extension would become possible if Unicode explicits less ambiguously how
    to handle documents containing Language tags (notably for their application
    scope within the encoded documents). When a plain text document would be later
    converted to some rich text format, the language tag could be extracted and put
    of band within some XML schema to describe the semantic of the encoded
    plain-text fragments containing PUAs, within their restricted scope.

    So instead of identifying PUAs only with thir codepoint (which is bound to a
    unique namespace), they would be identified within a namespace made of the
    private agreement URI, and the codepoint (quite similar to the concept of
    namespaces in XML, where all entities are named within a well defined scope).
    One way to cope with this would be then to reserve and bind all non-PUA and all
    invalid codepoints in all possible namespaces, to the namespace.

    There's a way to make those PUAs easily manageable by users:
    - let each user have a registry of PUA agreements (identified in interchanges by
    their URI). If the user accepts this agreement, it is recorded in that user's
    - the registry will map each described Unicode PUA codepoint to non-Unicode
    codepoints (for example in the larger 31-bit space which was originally defined
    for ISO 10646). These internal mappings will allow local-only management of
    these encoded strings. For all interchanges, all non-Unicode codepoints (out of
    the 17 first planes), will be looked up in the user database that will remap
    this 32-bit codepoint into the URI + the 21-bit Unicode PUA, so that either a
    plain-text document can be regenerated using language-tags tagging, or using XML
    attributes or either rich text format...
    - for local document handling, UTF-8 (the original version!) or UTF-32 could be
    used to easily manage all private character properties, without colliding with
    PUAs used in other private agreements or with other standard Unicode codepoints.

    Such solution would have the additional effect that it will greatly reduce the
    number of PUAs needed in Unicode and each one can use them the way he wants with
    its own sets of character properties (including by overriding the default
    combining classes and canonical decompositions!). No need to split the PUA space
    which is really large enough with more than 135,000 codepoints, to allow
    encoding any single private agreement.

    The difficulties will be in the way to describe this agreement within a URI:
    what should that URI provide? If it's a URL, it could be the one of a XML
    document describing the set of conventions and properties tables and sets of
    suggested or required fonts... The problem is then to create and maintain a
    schema that allows describing these conventions. Such schema should allow
    containing at least all the properties that already described in Unicode, plus
    some other private data or tables.

    The next complexity will be when one wants to extend and agreement to allow
    migrating data from one private convention to another one. This looks exactly
    like describing a transliteration scheme working within the larger local-only
    31-bit space... And it can be as complex as in other stateful transliteration
    schemes, or as simple as when mapping legacy 8-bit sets to Unicode. (using
    simple stateless mappings).

    This archive was generated by hypermail 2.1.5 : Wed Apr 28 2004 - 16:13:46 EDT