From: Language Analysis Systems, Inc. Unicode list reader (Unicode-mail@las-inc.com)
Date: Thu Apr 29 2004 - 14:27:47 EDT
I should probably just let this go, but I'm going to weigh in on the PUA
issue one more time. Mailing-list and Usenet discussions tend to be
unfocused by their very nature, so any attempt by me to focus this
discussion is probably doomed to failure, but I'm going to try anyway.
If nothing else, maybe clarifying my thoughts on this will provide text
I can incorporate into the next edition of my book, if it goes to
another edition.
First, let's try to clear up some misunderstandings about the PUA:
1) The Private Use Area is a set of code points in the Unicode code
space that are considered assigned, but whose semantics are purposely
left open. The Unicode standard takes no position as to the meanings or
uses of these code points. The basic idea was to set aside a range of
code points specifically for internal application use or exchange
between cooperating applications. It's true that an application (or
group of cooperating applications) can, internally, bowdlerize Unicode's
representation of characters however it wishes, but setting aside a
range of characters specifically for internal use helps keep people from
writing applications that accidentally produce non-interoperable output
or that can never support some script because they've cannibalized that
script's code points for their own internal use.
2) This usage is similar to the noncharacter code points, which are also
set aside for internal application use. The idea is generally that PUA
code points are to be used for characters, and noncharacters are to be
used for other things, such as sentinel values in APIs (the classic
example is using U+FFFF to represent EOF in a getChar()-like API).
3) PUA characters are unsuitable for general interchange because there's
no way of guaranteeing that the receiver of a document will apply the
same semantics to a run of PUA code points as the sender intended.
Guaranteeing this requires some sort of external agreement between
sender and receiver, at which point you're no longer talking about
Unicode plain text. You're either talking about fancy text of some
kind, or you're talking about plain text using some kind of
Unicode-derived encoding.
4) For applications that don't want to apply any particular semantics to
the PUA code points, the Unicode standard sets forth a set of default
properties they should apply to those code points if they encounter
them. For a certain relatively narrowly-defined range of uses, this
delegates the job of applying semantics to the code points to something
else, usually a font and a fancy-text mechanism that allows one to
choose a font. Even here, though, the application is applying some
semantics to those code points, and this only works when the intended
use of the code points is compatible with those semantics. It's
important to note that none of this really involves Unicode-- there's no
requirement to apply the default properties to the PUA code points;
Unicode simply recommends that applications that don't want to do
anything else do this. What else are they going to do? Signal an
error? This is the path of least resistance, and it's the choice the
maximizes the possibility that users will be able to use the PUA without
the application having to know or care.
So what kind of things might people want to do with PUA code points?
The possibilities are endless, of course, but I think the most-important
cases are probably these:
1) Logos, frequently-occurring pictorial elements, and dingbats,
bullets, and other decorative flourishes. You might have a GE corporate
font, for example, that has a couple different versions of the
"meatball" applied to PUA code points. In all cases, you're talking
about things which generally aren't "characters" and probably will never
be encoded in Unicode, but which certain user communities still need to
commingle with text and interact with as though they were text.
2) Newly-coined technical symbols. A particular researcher invents a
new mathematical operator, for example, and uses it in a paper or two.
Again, unless other people start using the new operator too, it isn't
suitable for encoding in Unicode, but the researcher still needs a way
to put it into his paper, and it's easier to be able to work with it as
text than to paste it in as a picture (provided, again, that he has a
way to make a font with that operator in it).
3) Scripts or characters that aren't in Unicode yet, but should and
probably will be. Here you need a way to deal with those characters
before they get into Unicode, and perhaps a way of prototyping
implementations as an input to the standards process. Still, this is an
interim use that is intended to be superseded by real, encoded code
points at some point in the future.
4) Scripts or characters that have been rejected by the UTC, but that a
significant community still wants to use.
Uses 1 and 2 are generally covered just fine by the default properties.
Most instances of uses 3 and 4 are also covered just fine by the default
properties. The discussion seems to be getting rather heated over how
to accommodate instances of use 3 that aren't covered by the default
properties. I think the discussion has been heated because advocates of
uses that don't fit the default properties feel that setting the default
properties a particular way (or, indeed, having default properties in
the first place) unfairly favors other user communities.
Seems like it might have made more political sense if Unicode really had
left the semantics of PUA code points completely open and not assigned
default properties. Or if it had assigned default properties that made
the PUA universally unusable, such as making them all default ignorable.
Then an application would have to explicitly know and care about some
set of PUA code points in order for them to be usable at all. Either
that, or most application vendors would have done essentially what
they're doing now, and the defaults we have would have been the de facto
defaults anyway.
Seems to me that the choice of defaults was designed to irritate the
smallest number of people possible and cover the widest range of use
cases possible, and that we're now hearing from people in that "smallest
possible" group.
Those people have legitimate needs. How should they be accommodated,
and how does Unicode participate in that process? Seems there are a
number of options:
1) Change the default properties for some range of the PUA. This is
what people seem to be pushing most hard. There are a number of
problems with this approach: a) How do you subdivide the range? There
are a lot of Unicode properties; are we to try to set aside a separate
range for each unique combination? If we don't, how do we avoid STILL
leaving somebody out in the cold? b) What about existing uses? If
there are a lot of user communities out there taking advantage of the
default behavior, and they're not all using the same part of the PUA,
changing the defaults will break their existing uses. This is okay as
far as the Unicode standard is concerned, but it won't be okay with real
users and real application vendors. Seems like even if Unicode changes
the default, you'd have a lot of application vendors who decide to leave
their PUA properties alone rather than break existing uses. Even if
this weren't a problem, there'd still be a lag while implementations
were changed.
2) Leave the current PUA alone, but set aside a new PUA, say Planes 12
and 13. This solves the existing-use problem, but you still have the
question of just how you subdivide the range, and it starts to cut down
significantly on the code points available for actual standardization.
3) Define ad-hoc standards that are based on Unicode but make character
assignments in the PUA and lobby application vendors to support these
encodings in addition to regular Unicode. Here, you'd still have an
implementation lag, and you're now defining some sort of parallel
standardization process. Better to just let the wheels of Unicode grind
their way toward real standardization and concentrate on real interim
solutions. This does seem like an option for things in case 4, although
I'm still not sure it's a good idea.
4) Lobby for operating-system vendors to extend their text engines to
allow properties of PUA code points to be configured. With some text
engines, a lot of the work is delegated to the font, and you can get
most of the effects you need by designing appropriate fonts. The big
case where this doesn't seem to be possible is OpenType. Maybe the user
communities who aren't being accommodated now should be lobbying for
things like pluggable OpenType renderers. This seems like a good idea,
but it's completely outside the scope of Unicode. This also doesn't
solve all the problems. Searching and sorting can be covered with
tailored collation weights on systems that let you specify collation
weights directly, but direct character property queries are tougher--
this stuff generally isn't configurable and making it configurable would
slow lots of things WAY down. But if you're mostly concerned with
rendering, these might be properties you can live without configuring.
5) Write specialized applications that are designed to deal with certain
scripts and address the needs of user communities whose needs aren't
being met right now. Of course it's better to use industry-standard
applications, but when they don't do the job, you write your own and
make do. This seems like something that could be done by the
open-source community.
6) Use markup or other fancy-text mechanisms to override the default
properties. There are plenty of controls for controlling
directionality, cursive joining, and line breaking. It may be
inconvenient to use them, but it seems like a viable workaround while
waiting for something to get into Unicode, and there's no implementation
lag. What problems do the existing mechanisms not solve? Maybe the
discussion should focus on this question-- are there mechanisms that
should be added to Unicode or some markup language to help enable some
of these scripts?
7) Design custom fonts that cannibalize existing code points that have
the right sets of properties. This is a terrible long-term solution,
but seems acceptable for interim use while waiting for something to be
standardized and implementations to catch up.
1 and 2 are the solutions Unicode can mostly do, but they're both
problematic and I don't think they solve all the problems. 6 and 7 are
the solutions that are available now without any OS or application
vendors having to do a thing. They're ugly, but it we're talking mostly
about stuff that's destined to find its way into Unicode someday, why
won't they work as interim solutions, and why isn't the energy currently
being put into solutions 1 and 2 better put into actually standardizing
these characters? 4 seems desirable, but hard to pull off without a lot
of market strength (although there might be enough of a market to make 5
feasible). And 3 seems like a bad idea all around.
I'd like to see the yelling stop and see people focus on the actual
needs and on realistic solutions insted of just complaining about how
the UTC isn't fair. And to the degree that the problem isn't really
with Unicode at all (which is mostly), I'd like to see that discussion
happen somewhere else.
--Rich Gillam
Language Analysis Systems, Inc.
This archive was generated by hypermail 2.1.5 : Thu Apr 29 2004 - 15:22:08 EDT