RE: Defined Private Use was: SSP default ignorable characters

From: Ernest Cline (
Date: Wed Apr 28 2004 - 15:00:51 EDT

  • Next message: Peter Constable: "RE: Defined Private Use was: SSP default ignorable characters"

    > > This is most easily and most naturally controlled by the end users
    > > of such introspective setups - simply do not allow conflicting PUA
    > > code points on their systems. In such a scenario, the operating
    > > system is not forced to make decisions.
    > That seems unduly limiting. If I want to write a document in two
    > scripts, each of which is supported by only one font, both of which use
    > the same code point range for their characters, I'm stuck.

    True, but not all that common a concern. Formatted documents
    enable one to specify which font is intended and the SSP Tags
    offer a possible solution for plaintext. The case of multiple
    Private Use scripts coming into conflict in the same document
    is rare enough that even I, a proponent of a better set of
    Private Use characters, am comfortable with having to depend
    upon formatting markup to make the distinction, altho something
    that could do so in plaintext would be nice, but not necessary.

    > There's been a lot of discussion of the PUA in this forum over the time
    > I've been on it, but I don't think I've heard anyone make the following
    > point:
    > If you're using the PUA outside a closed system, you're not using
    > Unicode.
    > The PUA is intended for the internal use of applications (or groups of
    > applications), or for interchange between applications by private
    > agreement of all parties involved. Writing a document in Microsoft Word
    > using some exotic script that doesn't have plain-vanilla behavior
    > violates this because Microsoft Word isn't a party to the private
    > agreement. You either have to write software yourself that does the
    > right thing with your characters (you don't have to rewrite Windows, but
    > you might have to rewrite Word, which I agree isn't really any more
    > realistic).
    > Therefore, if you're using the PUA out in the "wild" and expecting free
    > interchange, you're not using Unicode anymore; you're using a separate
    > encoding _based_ on Unicode. In many respects, it's identical to
    > Unicode, but it's a separate encoding because it applies additional
    > semantics to code points whose definition Unicode leaves open. It seems
    > to me that if you want to ensure that documents that make use of the PUA
    > are interpreted properly by, say, someone who downloads them from the
    > Web, you have to tag their encoding as something other than Unicode, and
    > if you want OS vendors to support particular semantics for PUA code
    > points, you have to ask them to support this other encoding that gives
    > those code points those semantics.
    > Of course, if you're going to try to standardize a use of the PUA, it
    > seems to make just as much sense to standardize the actual characters in
    > Unicode in the normal way. If we have a bunch of different
    > Unicode-derived encodings out there, that basically resurrects the
    > problem Unicode was designed to solve. But I'm beginning to think this
    > is already happening in some places.
    > Using Plane 14 tag characters to identify particular uses of the PUA
    > seems very akin to the old ISO 2022 code-switching scheme, and I
    > _really_ don't think we want to go there again.
    > In any event, imposing semantics on PUA code points in documents out in
    > the "wild" isn't a "private use," and therefore documents and
    > applications doing this are using an ad-hoc Unicode-derived encoding,
    > not Unicode. It should be dealt with as such, rather than trying to
    > turn Unicode into ISO 2022.

    There will always be scripts that Unicode will not support, either because
    they are constructed scripts with no real use, rare or ancient scripts that
    lack sufficient examples to determine how the script should be encoded,
    or are picture fonts. This last we can discount, because the existing
    private use area supports them adequately. (The distinction between
    the behavior of category Co and So is so minimal as to be not worth
    worrying about.) However, for real scripts that Unicode has either
    not yet or never will encode, there are currently two options for those
    who seek to implement them.

    1) Live with the limitations of the PUA and accept that your Private Use
    Script will never be able to do the things that other scripts take for

    2) Mimic your script on the basis of already encoded characters in an
    existing character encoding. Traditionally that has been done by
    created a font that claimed to be a particular legacy encoding, since
    fonts were the part of the operating system that had the greatest degree
    ease of user customization in a form that was relatively platform
    One reason for using this method was so as to also benefit from the
    keyboard already in use, but of late it has become easier to distribute
    input methods and there exists a strong current towards being able
    to do so interoperably across platforms. Once that barrier crumbles,
    we will see more private use scripts that instead of hijacking legacy
    encodings, will hijack Unicode unless the mechanism exists for
    them to establish their own properties. This mechanism could be
    established in one of two ways,

    A) The various OS's could provide an easy way to override the
    default Private Use characteristics. While the UTC would no doubt
    prefer that this solution was adopted, it is extremely unlikely to work.
    First of all, some of the properties, such as Line Breaking, that
    Unicode defines are implemented by applications, with varying
    levels of OS support. So not only would OS's have to adopt a
    mechanism for users to describe how they want the PUA used,
    applications would have to start consulting it, and there would
    need to a way to negotiate the characteristics to be used if the
    system knows of multiple ways that a particular PUA codepoint
    is defined by various private agreements.

    B) Unicode could provide a set of private use characters such as:

    E1000;PRIVATE LTR CAPITAL LETTER-1;Lu;0;L;;;;;N;;;E1001;;
    E1001;PRIVATE LTR SMALL LETTER-1;Ll;0;L;;;;;N;;;;E1000;

    # LineBreak.txt example

    Now whether U+E1000 gets used for
    or some other character in private use would remain as something
    that Unicode would not, and should not care about in the least.
    End users would still be hijacking codepoints for their non-standard
    uses, but at least they would no longer need to hijack codepoints
    that have standard interpretations such as LATIN CAPITAL LETTER U.
    Which would you rather have VERDURIAN CAPITAL LETTER U
    being mapped as in order to get the support for its casing properties,

    Note: Verdurian is one of the scripts in the ConScript Unicode registry.
    That registry does not hijack characters to get the desired properties
    supported, but has been used here as an example.
    Note: U+E1000 and U+E1001 are currently unassigned codepoints.
    They are however assigned the properties given above in the
    Private Use proposal I have been working on. That proposal has
    not reached even rough draft status, but it looks like it will be
    contained in the region U+E0F00 to U+E3FFF, excluding support
    for Ideographic characters. Ideographic support adds considerably
    to the size of the proposal, but such characters can be reasonably
    well supported by the large existing PUA blocks so it is not a priority.

    This archive was generated by hypermail 2.1.5 : Wed Apr 28 2004 - 16:00:08 EDT