Re: Towards a classification system for uses of the Private Use Area

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Sat Apr 27 2002 - 06:39:07 EDT


Many thanks to those who have replied to my posting.

I wonder if I may add further information in the hope of satisfactorily
responding to the comments made and also in the hope of encouraging interest
in the idea of the classification system.

The classification system is intended to be quite straightforward to use.
It does not necessarily need to be taken up by large manufacturers at all in
order to be successful. As far as I can tell, a few software tools such as
a competent programmer might fairly easily produce are all that would be
needed in order to get the system into use.

I am presently aware of four uses, or intended uses, of the Private Use
Area. There may well be others, of which I am interested to learn.

The four of which I am aware are as follows.

The ConScript registry.

Development of a character set for Egyptian hieroglyphics.

Development of a character set for Cuneiform tablets.

The eutocode system, which is part of my own research, which is mentioned in
the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) section
at http://www.users.globalnet.co.uk/~ngo which is our family webspace in
England.

Within my classification system, suppose please that someone developing the
character codes for Egyptian hieroglyphics requests that he or she be
assigned type tray 3001 and that someone developing character codes for
cuneiform requests the assignment of type tray C001 and that I request type
tray E001 for the eutocode system, and that all of these requests are
granted.

Then, in order to apply the classification system to any plain text file,
the file needs to contain some classification characters near the start.

For a file using the Egyptian hieroglyphics characters, the following
sequence would be needed.

U+F35B U+F333 U+F330 U+F330 U+F331 U+F35D

For a file using the cuneiform characters, the following sequence would be
needed.

U+F35B U+F343 U+F330 U+F330 U+F331 U+F35D

For a file using eutocode, the following sequence would be needed.

U+F35B U+F345 U+F330 U+F330 U+F331 U+F35D

Suppose then that one day someone comes across a plain text file and within
that plain text file are character codes from the Private Use Area and that
person has no idea as to which character set those character codes may be
intended to represent.

So, the person looks at that file using a word processing program and
chooses to use a specially made fount named findpuac.ttf (that is, the find
private use area classification fount) which has all characters as zero
width except for the eighteen characters in the U+F3.. block which I
mentioned in my previous posting, those eighteen characters being
implemented in the findpuac.ttf fount as having analysis glyphs as detailed
in my previous posting. The screen display gives a code of C001 which the
person can look up in a web based reference list and there finds out that it
is in fact a particular character set for cuneiform characters. The web
based reference list contains a link to a website from which the person
downloads a copy of a special fount that contains the cuneiform characters.
The plain text file is then displayed using that fount. That fount has the
eighteen characters in the U+F3.. block which I mentioned as being zero
width, so that they do not affect the display at all when the file is
displayed.

So, I suggest that the system is not too complex at all to implement and
use. The findpuac.ttf fount would be needed. I do not have the knowledge
or facilities to make that fount at present, yet as there are people who do
have fount generation packages and they can make founts with characters in
the Private Use Area, I feel that a findpuac.ttf fount, whilst being work
that would need to be done and it all takes time and may cost money, is
nevertheless not a goal the achievement of which seems prohibitively
difficult.

For the people producing the founts of Egyptian hieroglyphics characters and
cuneiform characters, there is the additional work of making eighteen
characters in the U+F3.. block as being of zero width, doing whatever needs
to be done so that those eighteen characters do not display as rectangles
indicating an unknown character. I do not know how to do that at present,
yet it does not seem a prohibitively difficult thing to get done.

For people producing a plain text file that includes the six character
sequence of classification codes there could be some problems. These could
be solved in practice by one of a number of methods. One method would be
starting off each user generated file by editing a file that contains the
six characters followed by an asterisk, so that the typist could add the
Egyptian hieroglyphics or cuneiform characters after the asterisk, then
delete the asterisk character using one push of the backspace key. Another
method would be to have a software tool that takes an ordinary plain text
file as input and produces a plain text file that is six characters longer,
having had the six classification characters added in at the start of the
new file.

I suggest that in an environment where plain text files for various projects
with various character sets are being used, that this classification system
is straightforward and relatively easy to get implemented and potentially
very useful. Please bear in mind that the production of the findpuac.ttf
fount is a process that need only be done once by someone somewhere. Maybe
someone reading this posting and who has the necessary facilities might even
make one up and make it available: all of the information that is needed
about the eighteen characters is included in these two postings. Once the
findpuac.ttf fount were made, a word processor package such as Microsoft
Word could be used to carry out the analysis on a PC.

This classification system also allows characters from two different Private
Use Area character sets to be included in the same file. For example, for
research on cuneiform tablets, a cuneiform character set could be used to
display characters and the data entry characters from eutocode (U+EC00
through to U+EFFF each load their ten least significant bits into an
accumulator of a calculating engine) and the data movement and marshalling
characters from eutocode (some codes from U+EB00 through to U+EBFF are used
for such purposes) could be used to store three-dimensional measurement
information (to 20 bits or 30 bits precision, or whatever is required) of
the character's physical shape in a particular tablet of clay that is under
consideration. Naturally, if the cuneiform characters are promoted then
using two different uses of the Private Use Area in the same plain text file
may no longer be necessary, yet the classification scheme would have perhaps
served a valuable purpose on an interim basis. This is important. It is
often the case that in order to get things achieved at a finished level that
one needs to carry out development as best one can with the facilities that
one has available. In hindsight, early experiments can look primitive, yet
they are often an important step without which the finished result could
not, or would not, have been achieved.

Whilst I like to hope that many people will wish to use this classification
system, in fact I do not need the agreement of anyone in order to get the
system started as the Unicode specification includes, in relation to the
Private Use Area, an implied right to publish character assignments.
Publication does not need the agreement of anyone. However, I feel that it
is important to try to gain a consensus and thus invite readers to comment
on the idea.

I am hopeful that this classification system will become widely used and
will provide a useful optional additional facility for people who make use
of the Private Use Area.

William Overington

27 April 2002



This archive was generated by hypermail 2.1.2 : Sat Apr 27 2002 - 07:27:55 EDT