RE: Defined Private Use was: SSP default ignorable characters

From: Language Analysis Systems, Inc. Unicode list reader (
Date: Wed Apr 28 2004 - 11:14:12 EDT

  • Next message: Peter Constable: "RE: Romanian and Cyrillic"

    >This is most easily and most naturally controlled by the end users of
    such introspective setups - simply do
    >not allow conflicting PUA code points on their systems. In such a
    scenario, the operating system is not
    >forced to make decisions.

    That seems unduly limiting. If I want to write a document in two
    scripts, each of which is supported by only one font, both of which use
    the same code point range for their characters, I'm stuck.

    >But if there were multiple PUA fonts with competing code points on a
    single system, I suggest that OS'es
    >should simply go with the first font (defining "first" as that font
    whose file name has the lowest Unicode-
    >code-point-based "alphabetic" order).

    If the principle behind the PUA is that anyone can do anything with
    these code points, standardizing anything about them erodes that
    principle and potentially breaks existing software making use of the

    There's been a lot of discussion of the PUA in this forum over the time
    I've been on it, but I don't think I've heard anyone make the following

    If you're using the PUA outside a closed system, you're not using

    The PUA is intended for the internal use of applications (or groups of
    applications), or for interchange between applications by private
    agreement of all parties involved. Writing a document in Microsoft Word
    using some exotic script that doesn't have plain-vanilla behavior
    violates this because Microsoft Word isn't a party to the private
    agreement. You either have to write software yourself that does the
    right thing with your characters (you don't have to rewrite Windows, but
    you might have to rewrite Word, which I agree isn't really any more

    Therefore, if you're using the PUA out in the "wild" and expecting free
    interchange, you're not using Unicode anymore; you're using a separate
    encoding _based_ on Unicode. In many respects, it's identical to
    Unicode, but it's a separate encoding because it applies additional
    semantics to code points whose definition Unicode leaves open. It seems
    to me that if you want to ensure that documents that make use of the PUA
    are interpreted properly by, say, someone who downloads them from the
    Web, you have to tag their encoding as something other than Unicode, and
    if you want OS vendors to support particular semantics for PUA code
    points, you have to ask them to support this other encoding that gives
    those code points those semantics.

    Of course, if you're going to try to standardize a use of the PUA, it
    seems to make just as much sense to standardize the actual characters in
    Unicode in the normal way. If we have a bunch of different
    Unicode-derived encodings out there, that basically resurrects the
    problem Unicode was designed to solve. But I'm beginning to think this
    is already happening in some places.

    Using Plane 14 tag characters to identify particular uses of the PUA
    seems very akin to the old ISO 2022 code-switching scheme, and I
    _really_ don't think we want to go there again.

    In any event, imposing semantics on PUA code points in documents out in
    the "wild" isn't a "private use," and therefore documents and
    applications doing this are using an ad-hoc Unicode-derived encoding,
    not Unicode. It should be dealt with as such, rather than trying to
    turn Unicode into ISO 2022.

    --Rich Gillam
      Language Analysis Systems, Inc.

    This archive was generated by hypermail 2.1.5 : Wed Apr 28 2004 - 12:18:01 EDT