Generic Tagging: A Modest Proposal

From: John Cowan (john_cowan@hotmail.com)
Date: Mon Jul 14 1997 - 17:43:48 EDT


Due to network problems, I can read mail at cowan@ccil.org, but
can't post/reply/send from there. Please direct all private replies
to cowan@ccil.org , not the HotMail address. Thanks.

The following Modest Proposal, Version 0.1, is intended to provide
a general solution to the recurring demands for embedded tags
about this, that, and the other (language, CJK script, etc. etc.).

The general idea is to encode tags in a highly restricted Latin
alphabet, the same used for Unicode character names,
augmented by a *path separator* character that allows the construction
of a hierarchy of tags. In addition, tags begin with one of two
characters which delimit consecutive tags and distinguish
private-use tags from tags created within a standard global tag
hierarchy.

Here are the characters, which occupy 3 columns of the BMP. For
concreteness, I am using the last three columns of the CJK
Compatibility Ideographs block, FAD0-FAFF (there are no assigned
codepoints in those columns), because it seems to me that these
characters naturally belong in the Compatibility Area/R-Zone, since
they are provided for compatibility with industry practices (to
wit, inline tags). They could also go in the Specials block, except
that there are too many of them. But this placement is not an
essential point: anywhere on the BMP would do.

FAD0 META PUBLIC TAG
FAD1 META PRIVATE TAG
FAD2 META PATH SEPARATOR
FAD3 META SPACE
FAD4 META LETTER A
...
FAED META LETTER Z
FAEE META LETTER HYPHEN-MINUS
FAEF unused
FAF0 META DIGIT ZERO
...
FAF9 META DIGIT NINE
FAFA unused
FAFB unused
FAFC unused
FAFD unused
FAFE unused
FAFF unused

These characters have uniform properties, as follows:

Category: Cf (Control, Format)
   or a new category Cm (Control, Meta)
Combining: 0
BIDI: Other Neutral
Mirrored: No

Because these codepoints don't overlap with regular ones, tag
stripping is a matter of the usual Unicode rule of "ignore and
pass through what you do not understand." Unlike Escape or
Control sequences, no special stripping logic is required.
Because the number of codepoints is small, they can all be
on the BMP.

For the purposes of this document, the space, letters, and digits
will be represented by their ASCII equivalents, the META PATH
SEPARATOR by +, the META PUBLIC TAG by #, and the META PRIVATE
TAG by *. In glyph charts, these characters could be used
enclosed in dotted boxes (because META characters are normally
non-printing).

A tag, under this scheme, is a consecutive run of META characters
beginning with either a META PUBLIC TAG or a META PRIVATE TAG.
Private tags, therefore, can be as short as two Unicode characters:
private tags for "C", "J", and "K" could be encoded as "*C" (FAD1+FAD6),
"*J" (FAD1+FADD), and "*K" (FAD1+FADE). Private tags
can only be used in the presence of an agreement about their meanings.

For robustness, a tag character found with no preceding public
or private tag character should be treated as the beginning of a private
tag, but exploiting this feature is bad, because it is
impossible to tell where one private tag stops and another begins.

Public tags begin with the META PUBLIC TAG character, and
contain runs of META LETTERs, META DIGITs, or META SPACEs. These
runs are called "tag components", and are separated by single META
PATH SEPARATOR characters. (Different subcultures may think of META
PATH SEPARATOR as a slash, a backslash, a colon, or a full stop.)
Text processes that understand some public tags but not others
can ignore and pass through those they do not understand without
fear of confusion.

Each public tag belongs to a hierarchy similar to that used
in most file systems or by the DNS (except that the most significant
component is on the left, so-called "big-endian" convention).
An organization with a registered DNS domain can create public tags
from the hierarchy represented by reversing its domain name.

Thus Apple Computer tags would begin with "#COM+APPLE+".
This is essentially the same rule promulgated by Javasoft for
naming Java packages. ISO-specified tags might look something like
"#CH+ISO+10xxx+whatever", where 10xxx is the standard specifying
the tags.

Public tags are more verbose than private ones, but as long as the
rules are followed, different users of tags will not step on
one another's tails.

Sorting tags in codepoint order will produce a top-down traversal of
the tag hierarchy. It would have been nice to make the META
characters sort in the same way as their ASCII analogues, but this
cannot be done, because META SPACE must sort after META PATH
SEPARATOR.

John Cowan cowan@ccil.org
        Please do not use "Reply"
        e'osai ko sarji la lojban.
_______________________________________________________
Get Private Web-Based Email Free http://www.hotmail.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT