Peter Constable wrote:
> With or without the conventions and registry William is
> suggesting, the real issue still isn't addressed: in
> what form do I communicate to you what my PUA
> codepoints mean. [...]
> I'd rather just provide you with a database containing
> the semantics of my PUA codepoints in some format that
> we've agreed upon (or, at least, that I've documented).
I think that there are two separate issues in this:
1) How to describe the properties of my PUA characters;
2) How to specify that a certain text file uses my PUA convention rather
some one else's.
For issue #1, why not documenting the semantics of PUA characters using the
same conventions and formats used by the Unicode Consortium to document the
semantics of "normal" characters?
I.e., why not define your own deltas for the files in:
http://www.unicode.org/Public/UNIDATA/
Regardless that these conventions and formats be considered good or bad, it
is quite likely that all Unicode systems are prepared to process them, in a
way or another (either directly or by converting them to some native
format).
Mimicking the official Unicode property files, designers of PUA character
sets would be able to provide most of the information that applications need
to know about their private characters, e.g.:
- The bidirectional category;
- Decomposition data (which might allow conversion to non-private code
points, if applicable);
- Case categorization and mapping;
- Numerical properties;
- Useful mnemonic names and comments for each character;
- Even minimal shaping specifications for joining scripts, as is done for
Arabic and Syriac.
Of course, this is not necessarily ALL the information needed, but a fairly
good part of it.
One interesting addition could be a convention (a BDF font? a set of JPEG's?
etc.) to provide default glyphs to be used as a fallback presentation (OK,
it's rude, but not as much as black boxes).
About issue #2 (specifying the PUA convention used by a particular document)
I had a crazy idea... I am sure that the list will take just a few minutes
to destroy it. :-)
I was wondering: why not using plane-14 tag characters? So far the only use
for those characters is to specify *language* in plain text, in combination
with a language-tagging prefix character:
U-0E0001 (LANGUAGE TAG)
The wild idea is to add a tag prefix for specifying "PUA semantics" in plain
text:
* U-0E0002 (PUA INTERPRETATION TAG)
This prefix would be followed by a sequence of tag characters
(U-0E0020..U-0E007F) that specifies the meaning of the PUA characters used
from that point onwards.
I don't know how exactly how this string of tag characters should be.
Perhaps the format could even be flexible enough to accommodate more than
one convention.
Assuming the convention above of using UniData-like files, the tag
characters could specify an URL for these files (or for the directory
containing them).
Comments?
_ Marco
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT