An attempt to focus the PUA discussion [long]

From: Language Analysis Systems, Inc. Unicode list reader (
Date: Thu Apr 29 2004 - 14:27:47 EDT

  • Next message: Peter Kirk: "Re: New contribution"

    I should probably just let this go, but I'm going to weigh in on the PUA
    issue one more time. Mailing-list and Usenet discussions tend to be
    unfocused by their very nature, so any attempt by me to focus this
    discussion is probably doomed to failure, but I'm going to try anyway.
    If nothing else, maybe clarifying my thoughts on this will provide text
    I can incorporate into the next edition of my book, if it goes to
    another edition.
    First, let's try to clear up some misunderstandings about the PUA:
    1) The Private Use Area is a set of code points in the Unicode code
    space that are considered assigned, but whose semantics are purposely
    left open. The Unicode standard takes no position as to the meanings or
    uses of these code points. The basic idea was to set aside a range of
    code points specifically for internal application use or exchange
    between cooperating applications. It's true that an application (or
    group of cooperating applications) can, internally, bowdlerize Unicode's
    representation of characters however it wishes, but setting aside a
    range of characters specifically for internal use helps keep people from
    writing applications that accidentally produce non-interoperable output
    or that can never support some script because they've cannibalized that
    script's code points for their own internal use.
    2) This usage is similar to the noncharacter code points, which are also
    set aside for internal application use. The idea is generally that PUA
    code points are to be used for characters, and noncharacters are to be
    used for other things, such as sentinel values in APIs (the classic
    example is using U+FFFF to represent EOF in a getChar()-like API).
    3) PUA characters are unsuitable for general interchange because there's
    no way of guaranteeing that the receiver of a document will apply the
    same semantics to a run of PUA code points as the sender intended.
    Guaranteeing this requires some sort of external agreement between
    sender and receiver, at which point you're no longer talking about
    Unicode plain text. You're either talking about fancy text of some
    kind, or you're talking about plain text using some kind of
    Unicode-derived encoding.
    4) For applications that don't want to apply any particular semantics to
    the PUA code points, the Unicode standard sets forth a set of default
    properties they should apply to those code points if they encounter
    them. For a certain relatively narrowly-defined range of uses, this
    delegates the job of applying semantics to the code points to something
    else, usually a font and a fancy-text mechanism that allows one to
    choose a font. Even here, though, the application is applying some
    semantics to those code points, and this only works when the intended
    use of the code points is compatible with those semantics. It's
    important to note that none of this really involves Unicode-- there's no
    requirement to apply the default properties to the PUA code points;
    Unicode simply recommends that applications that don't want to do
    anything else do this. What else are they going to do? Signal an
    error? This is the path of least resistance, and it's the choice the
    maximizes the possibility that users will be able to use the PUA without
    the application having to know or care.
    So what kind of things might people want to do with PUA code points?
    The possibilities are endless, of course, but I think the most-important
    cases are probably these:
    1) Logos, frequently-occurring pictorial elements, and dingbats,
    bullets, and other decorative flourishes. You might have a GE corporate
    font, for example, that has a couple different versions of the
    "meatball" applied to PUA code points. In all cases, you're talking
    about things which generally aren't "characters" and probably will never
    be encoded in Unicode, but which certain user communities still need to
    commingle with text and interact with as though they were text.
    2) Newly-coined technical symbols. A particular researcher invents a
    new mathematical operator, for example, and uses it in a paper or two.
    Again, unless other people start using the new operator too, it isn't
    suitable for encoding in Unicode, but the researcher still needs a way
    to put it into his paper, and it's easier to be able to work with it as
    text than to paste it in as a picture (provided, again, that he has a
    way to make a font with that operator in it).
    3) Scripts or characters that aren't in Unicode yet, but should and
    probably will be. Here you need a way to deal with those characters
    before they get into Unicode, and perhaps a way of prototyping
    implementations as an input to the standards process. Still, this is an
    interim use that is intended to be superseded by real, encoded code
    points at some point in the future.
    4) Scripts or characters that have been rejected by the UTC, but that a
    significant community still wants to use.
    Uses 1 and 2 are generally covered just fine by the default properties.
    Most instances of uses 3 and 4 are also covered just fine by the default
    properties. The discussion seems to be getting rather heated over how
    to accommodate instances of use 3 that aren't covered by the default
    properties. I think the discussion has been heated because advocates of
    uses that don't fit the default properties feel that setting the default
    properties a particular way (or, indeed, having default properties in
    the first place) unfairly favors other user communities.
    Seems like it might have made more political sense if Unicode really had
    left the semantics of PUA code points completely open and not assigned
    default properties. Or if it had assigned default properties that made
    the PUA universally unusable, such as making them all default ignorable.
    Then an application would have to explicitly know and care about some
    set of PUA code points in order for them to be usable at all. Either
    that, or most application vendors would have done essentially what
    they're doing now, and the defaults we have would have been the de facto
    defaults anyway.
    Seems to me that the choice of defaults was designed to irritate the
    smallest number of people possible and cover the widest range of use
    cases possible, and that we're now hearing from people in that "smallest
    possible" group.
    Those people have legitimate needs. How should they be accommodated,
    and how does Unicode participate in that process? Seems there are a
    number of options:
    1) Change the default properties for some range of the PUA. This is
    what people seem to be pushing most hard. There are a number of
    problems with this approach: a) How do you subdivide the range? There
    are a lot of Unicode properties; are we to try to set aside a separate
    range for each unique combination? If we don't, how do we avoid STILL
    leaving somebody out in the cold? b) What about existing uses? If
    there are a lot of user communities out there taking advantage of the
    default behavior, and they're not all using the same part of the PUA,
    changing the defaults will break their existing uses. This is okay as
    far as the Unicode standard is concerned, but it won't be okay with real
    users and real application vendors. Seems like even if Unicode changes
    the default, you'd have a lot of application vendors who decide to leave
    their PUA properties alone rather than break existing uses. Even if
    this weren't a problem, there'd still be a lag while implementations
    were changed.
    2) Leave the current PUA alone, but set aside a new PUA, say Planes 12
    and 13. This solves the existing-use problem, but you still have the
    question of just how you subdivide the range, and it starts to cut down
    significantly on the code points available for actual standardization.
    3) Define ad-hoc standards that are based on Unicode but make character
    assignments in the PUA and lobby application vendors to support these
    encodings in addition to regular Unicode. Here, you'd still have an
    implementation lag, and you're now defining some sort of parallel
    standardization process. Better to just let the wheels of Unicode grind
    their way toward real standardization and concentrate on real interim
    solutions. This does seem like an option for things in case 4, although
    I'm still not sure it's a good idea.
    4) Lobby for operating-system vendors to extend their text engines to
    allow properties of PUA code points to be configured. With some text
    engines, a lot of the work is delegated to the font, and you can get
    most of the effects you need by designing appropriate fonts. The big
    case where this doesn't seem to be possible is OpenType. Maybe the user
    communities who aren't being accommodated now should be lobbying for
    things like pluggable OpenType renderers. This seems like a good idea,
    but it's completely outside the scope of Unicode. This also doesn't
    solve all the problems. Searching and sorting can be covered with
    tailored collation weights on systems that let you specify collation
    weights directly, but direct character property queries are tougher--
    this stuff generally isn't configurable and making it configurable would
    slow lots of things WAY down. But if you're mostly concerned with
    rendering, these might be properties you can live without configuring.
    5) Write specialized applications that are designed to deal with certain
    scripts and address the needs of user communities whose needs aren't
    being met right now. Of course it's better to use industry-standard
    applications, but when they don't do the job, you write your own and
    make do. This seems like something that could be done by the
    open-source community.
    6) Use markup or other fancy-text mechanisms to override the default
    properties. There are plenty of controls for controlling
    directionality, cursive joining, and line breaking. It may be
    inconvenient to use them, but it seems like a viable workaround while
    waiting for something to get into Unicode, and there's no implementation
    lag. What problems do the existing mechanisms not solve? Maybe the
    discussion should focus on this question-- are there mechanisms that
    should be added to Unicode or some markup language to help enable some
    of these scripts?
    7) Design custom fonts that cannibalize existing code points that have
    the right sets of properties. This is a terrible long-term solution,
    but seems acceptable for interim use while waiting for something to be
    standardized and implementations to catch up.
    1 and 2 are the solutions Unicode can mostly do, but they're both
    problematic and I don't think they solve all the problems. 6 and 7 are
    the solutions that are available now without any OS or application
    vendors having to do a thing. They're ugly, but it we're talking mostly
    about stuff that's destined to find its way into Unicode someday, why
    won't they work as interim solutions, and why isn't the energy currently
    being put into solutions 1 and 2 better put into actually standardizing
    these characters? 4 seems desirable, but hard to pull off without a lot
    of market strength (although there might be enough of a market to make 5
    feasible). And 3 seems like a bad idea all around.
    I'd like to see the yelling stop and see people focus on the actual
    needs and on realistic solutions insted of just complaining about how
    the UTC isn't fair. And to the degree that the problem isn't really
    with Unicode at all (which is mostly), I'd like to see that discussion
    happen somewhere else.
    --Rich Gillam
      Language Analysis Systems, Inc.

    This archive was generated by hypermail 2.1.5 : Thu Apr 29 2004 - 15:22:08 EDT