Re: An attempt to focus the PUA discussion [long]

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 01 2004 - 10:07:52 CST


From: "Ernest Cline" <ernestcline@mindspring.com>
> However, this does point out the desirability of having codepoints that
> while their semantics are defined privately, their property values are defined
> by Unicode. This is because while an implementation conforming to a
particular
> Private Use should be able to normalize according to that Private Use, doing
> so makes the data unfit to be examined by any tools not aware of that
> Private Use. This complicates the task for Private Uses, as they essentially
> have to not only reinvent the wheel, but the lever, pulley, and screw as well.
>
> If it could be made possible to have standard tools following the standard
> Unicode rules while using the standard default Unicode properties, it would
> make it easier to accommodate Private Uses.

Isn't Unicode already providing a syntax for its standard properties files,
which
could be used to create private annexes to the standard to specify private
properties of PUAs?

If so, there's not lot to reinvent, if an application can already process the
standard Unicode properties files. It's just an additional set of files to merge
with the standard ones (those being precompiled for performance).

Such extension mechanism is already used and described for collation when
tailoring the DUCET for a particular language. Unicode needed to describe
the collation tailoring because it implied standard characters and not only
PUAs.

But when I look at ICU for example, concrete extension syntaxes can be used
by users to edit, compile, and use their own collation tables at run-time. Java
has such a tool integrated for collation too.

With LDML, a new XML-based syntax has been introduced for collation rules.
Nothing prevents making similar XML-based character properties tables to
represent the standard properties files and the tailored PUAs.

So a set of private conventions for PUAs could finally be represented too with
similar syntaxes using some defined document format or XML schema. And such
formal definition could finally get its own URI used to tag documents and fonts
using PUAs conforming to these private definitions.

Such a tool could also be used to create testbeds for future extensions to the
standard: for now the extensions are described using unassigned codepoints,
but without a definite and working environment on which the properties can be
experimented. With such a tool, new proposed characters could be tested and
used with PUAs, waiting for a standardization. If a document using them is
correctly taged with the versioned PUA convension URI, there will be no
ambiguity
later to convert the experimental encoding based on PUAs to the new standardized
characters (if they get accepted and included in a later version of the
standard).

Also, a Unicode implementation that is based on a older precompiled version of
the
UCD could support with this extension mechanism an emulation for the newer
standardized characters, while keeping their full compatibility with the
previous
version of the standard.

Say that Unicode and ISO/IEC 10646 accept to assign the CEDI currency sign at
U+20B2 in Unicode 4.1, but that an implementation todays is limited to the
repertoire of Unicode 4.2, U+20B2 may not be acceptable or handled correctly
with that implementation, but it could be handled correctly if the
implementation
supports an extension mechanism that allows importing the Unicode 4.2 properties
files. This could be done in two ways:
- without reencoding, this would require allowing the Unicode implementation to
accept properties for unassigned reserved codepoints. This may be a security
issue, unless the definition file is signed to allow changing the normative
properties of non PUA code points.
- If the document is encoded for backward compatibility with a PUA such as
U+F000
and tagged with a URI indicating that this PUA corresponds to the definition of
a
extended repertoire identified by this URI, that maps its properties correctly,
then
the existing implementation will be able to use the PUA for the CEDI character
directly without even knowing that it was later standardized to U+20B2.

In the second case above, a "gateway" interface may provide the encoding
conversion using a local repository of PUA conventions accepted. This second
case would be the one used to create a test bed, including with font support for
this proposed new character.

More generally, testbed implementations have their advantages: they don't
require a
too early standardization of normative character properties that may be known to
cause problems later. The proposal would pass the acceptance filter if it starts
getting
used properly with the decribed private convention. Users will still be free to
accept or
reject these conventions in their local repository, or to use their alternate
definition.

Some scripts will never pass the experimental status: some documents and fonts
will start being created, but that convention will be upgraded later and the
former
private convention abandonned. Users of a private convention will always risk
that
this convention will nat pass the standardization level where the characters
will be
accepted.

Searchers will then be able to create their own experimental scripts for their
books
and publications, and to forget this once the book is finalized and published.
Authors will be able to customize their documents the way they want, using the
benefits of a large standard Unicode set, and a very reduced set of private
characters and conventions that they will be able to create themselves.

These testbeds could be used to demonstrate the ability of a proposed subset of
characters or modified properties to support and resolve more linguistic,
phonetic,
semantic, and orthographic problems for actual languages than without this
standardization.

This way of working would then be more open and more modern than by hacking
some existing encodings or creating new encodings. If a private convention is
starting to get some levels of acceptance by a community of users, these users
could register their PUA convention URI in the IANA database of charsets,
deprecating the old way with legacy 8-bit charsets with poor formal descriptions
about their properties.

And finally, Unicode and ISO/10646 could consider some of those registered PUA
conventions and compare the merits of each of them, to create a standard
extension that could match the current use by effective communities, and
exhibit also the possible contradictions between these conventions.

Then, once a newer Unicode version is issued with the new characters these
communities could update their conventions to map them to the new standardized
repertoire by annexing their convention. Users in these communities would have
a clear established model to make the transition: they could use some transcoder
service to reencode their documents and data to the newer standard.

For Unicode, or ISO/10646 this process will not look different than working with
legacy 8-bit charsets, except that the task will become easier as PUA
conventions
will have more formal definitions, existing implementations, and an existing
open processing model for supporting them.

This way we won't be limited by a strict "private" status for PUAs. What will
remain
private is the PUA codepoint assignment used in the context of a convention, but
each pair (convention+PUA codepoint) will become interchangeable in a limited
but
useful way. It's very difficult to maintain a "private" usage completely
private. Texts
are created first to be interchanged between users, and there's no upper bound
on
the number of users and no strict fence avoiding that a private convention will
not
be reused by a larger community. Once good example is the ConScript registry,
which is already a significant comunity of users that already interchange data
using
those PUAs.



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT