RE: Defined Private Use was: SSP default ignorable characters

From: Language Analysis Systems, Inc. Unicode list reader (Unicode-mail@las-inc.com)
Date: Wed Apr 28 2004 - 11:14:12 EDT

Next message: Peter Constable: "RE: Romanian and Cyrillic"

Previous message: Peter Constable: "RE: Croatian"
Maybe in reply to: Ernest Cline: "Defined Private Use was: SSP default ignorable characters"
Next in thread: Peter Kirk: "Re: Defined Private Use was: SSP default ignorable characters"
Reply: Peter Kirk: "Re: Defined Private Use was: SSP default ignorable characters"
Reply: Mark E. Shoulson: "Re: Defined Private Use was: SSP default ignorable characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>This is most easily and most naturally controlled by the end users of
such introspective setups - simply do
>not allow conflicting PUA code points on their systems. In such a
scenario, the operating system is not
>forced to make decisions.

That seems unduly limiting. If I want to write a document in two
scripts, each of which is supported by only one font, both of which use
the same code point range for their characters, I'm stuck.

>But if there were multiple PUA fonts with competing code points on a
single system, I suggest that OS'es
>should simply go with the first font (defining "first" as that font
whose file name has the lowest Unicode-
>code-point-based "alphabetic" order).

If the principle behind the PUA is that anyone can do anything with
these code points, standardizing anything about them erodes that
principle and potentially breaks existing software making use of the
PUA.

There's been a lot of discussion of the PUA in this forum over the time
I've been on it, but I don't think I've heard anyone make the following
point:

If you're using the PUA outside a closed system, you're not using
Unicode.

The PUA is intended for the internal use of applications (or groups of
applications), or for interchange between applications by private
agreement of all parties involved. Writing a document in Microsoft Word
using some exotic script that doesn't have plain-vanilla behavior
violates this because Microsoft Word isn't a party to the private
agreement. You either have to write software yourself that does the
right thing with your characters (you don't have to rewrite Windows, but
you might have to rewrite Word, which I agree isn't really any more
realistic).

Therefore, if you're using the PUA out in the "wild" and expecting free
interchange, you're not using Unicode anymore; you're using a separate
encoding _based_ on Unicode. In many respects, it's identical to
Unicode, but it's a separate encoding because it applies additional
semantics to code points whose definition Unicode leaves open. It seems
to me that if you want to ensure that documents that make use of the PUA
are interpreted properly by, say, someone who downloads them from the
Web, you have to tag their encoding as something other than Unicode, and
if you want OS vendors to support particular semantics for PUA code
points, you have to ask them to support this other encoding that gives
those code points those semantics.

Of course, if you're going to try to standardize a use of the PUA, it
seems to make just as much sense to standardize the actual characters in
Unicode in the normal way. If we have a bunch of different
Unicode-derived encodings out there, that basically resurrects the
problem Unicode was designed to solve. But I'm beginning to think this
is already happening in some places.

Using Plane 14 tag characters to identify particular uses of the PUA
seems very akin to the old ISO 2022 code-switching scheme, and I
_really_ don't think we want to go there again.

In any event, imposing semantics on PUA code points in documents out in
the "wild" isn't a "private use," and therefore documents and
applications doing this are using an ad-hoc Unicode-derived encoding,
not Unicode. It should be dealt with as such, rather than trying to
turn Unicode into ISO 2022.

--Rich Gillam
Language Analysis Systems, Inc.

Next message: Peter Constable: "RE: Romanian and Cyrillic"
Previous message: Peter Constable: "RE: Croatian"
Maybe in reply to: Ernest Cline: "Defined Private Use was: SSP default ignorable characters"
Next in thread: Peter Kirk: "Re: Defined Private Use was: SSP default ignorable characters"
Reply: Peter Kirk: "Re: Defined Private Use was: SSP default ignorable characters"
Reply: Mark E. Shoulson: "Re: Defined Private Use was: SSP default ignorable characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Apr 28 2004 - 12:18:01 EDT