Re: Towards a classification system for uses of the Private Use Area

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Mon Apr 29 2002 - 11:05:17 EDT


Michael Everson raises some interesting matters.

>>Development of a character set for Egyptian hieroglyphics.
>
>I'm involved with this and there's been no talk about using the PUA.
>This is because existing 8-bit systems comprising a superset of what
>is likely to be encoded can easily be transcoded directly to Unicode
>without need for testing in the PUA. Because there are existing
>implementation and all that's needed is a set of (probably rich text)
>mapping tables.

My reason for including Egyptian Hieroglyphics was that I seemed to remember
something in one of the threads in this discussion forum about Egyptian
being encoded and using about 800 characters to start. It may have been at
the time, in the same or another thread, that someone mentioned using the
Private Use Area for testing out characters before being added into the
Private Use Area.

Mention was made about continuing the discussion in a specialised discussion
group at Unicode and I tried to join that discussion list. I could not find
it on the webspace and so I enquired. It appears that I would most likely
have had to have already have had expertise in the language in order to join
the discussion group and, since I did not and was just interested as a
general education topic to learn about the typography, I decided not to
apply. I mention this because it has been a feature of this discussion
group that when I or some others put forward a new idea that some people -
not you - suggest that the person with the idea spend their energies on one
of the many problems that need to be done and so forth, yet the reality is
that unless one is representing an organization or has specific linguistic
expertise already that such opportunities are not available.

>
>>Development of a character set for Cuneiform tablets.
>
>Some use of PUA characters has I think been used by some font
>developers. But it's for exchange within a very small group of
>investigators.
>

Well, I would be interested to have a look at those founts if that were
possible. Indeed, following the recent posting about cuneiform in this
forum, I have been having a look at various web sites about cuneiform and
the subject is fascinating.

If one first goes to

http://www.jhu.edu/ice/

one can follow a link in the SEE ALSO: section of that document labelled
database which takes one to

http://www.jhu.edu/ice/database/cuneiformsigns.xml

which contains a link to a project at the University of Birmingham at

http://www.eee.bham.ac.uk/cuneiform/

in England.

Clicking there on the NoFrames link leads to

http://www.eee.bham.ac.uk/cuneiform/menunf.html

and there is lots of interesting information to be found in the links from
that page.

In particular, the link Cuneiform leads to

http://www.eee.bham.ac.uk/cuneiform/cuneiform.html

which page includes a marvellous interactive illustration that uses a
Viewpoint Media Player.

There is a link labelled Instructions which leads to

http://www.eee.bham.ac.uk/cuneiform/instructions.html

and there is there a link for a free download of a Viewpoint Media Player
plug in.

Going back to

http://www.eee.bham.ac.uk/cuneiform/menunf.html

one can then use the link labelled Results in the 3D Digital Imaging list
which leads to the

http://www.eee.bham.ac.uk/cuneiform/3dresults.html

web page.

Down that page is a link labelled

Click here to interact with a complete 3D scan of the cuneiform tablet!

and that takes one to

http://www.eee.bham.ac.uk/cuneiform/tab1.html

which is a large display using the Viewpoint Media Player.

This is absolutely fantastic and I feel that my knowledge of what is
available to be done on computers in the home over the internet leapt
forward a quantum leap by this demonstration.

I then disconnected the telephone line link to the internet and the display
still worked.

However, if I go back then forward on the browser, the display is lost.

The instructions page

http://www.eee.bham.ac.uk/cuneiform/instructions.html

has a link labelled Viewpoint with the message To find out more about VET,
go to Viewpoint.

which opens a new window at

http://www.viewpoint.com/

VET stands for Viewpoint Experience Technology.

There is at that website a DEVELOPER CENTRAL section and there is there a
link labelled learn the basics where there is a file getstarted.pdf
available. This is 792 kilobytes and can be downloaded to local storage for
off-line viewing.

There are also various other items available for download, yet I have not
yet reached that stage.

----

As I read about these ancient clay tablets and the great quantity of them that exist I began to wonder as to what information about observations of the night sky in ancient times survive. For example, are there any observations of comets that could be tied in with the object that we now know as Halley's comet? Or, when a new comet appears in the present day sky and its orbit is calculated and it is said to return once in every 3000 years or so, is it possible to find it being observed in ancient times and that observation recorded on clay tablets?

A search at http://www.yahoo.com using an advanced search with an AND on the two words cuneiform and astronomy gave a number of interesting links.

One such is http://www.lexiline.com/lexiline/lexi42.htm which has a picture of a clay tablet with a diagram upon it.

>>The eutocode system, which is part of my own research, which is mentioned in >>the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) section >>at http://www.users.globalnet.co.uk/~ngo which is our family webspace in >>England. > >"The unicode system is today a 21 bit system. Details are at the >http://www.unicode.org website."? Where is a "21-bit system" >mentioned on the Unicode website?

The quote is from the file http://www.users.globalnet.co.uk/~ngo/ast02900.htm and is at the start of the text of the document, which document is dated Tuesday 12 February 2002.

The answer to your specific question is that I am, after a bit of a search, not able to find "21-bit system" mentioned on the Unicode website.

There is a reference in a FAQ about UTF and BOM to 21 significant bits.

I have learned a lot by looking up that document and in trying to understand quite exactly why my statement should be questioned. Is it that Unicode is not defined as being a system of any number of bits but just as a system that can be represented in various ways using 8 bit bytes, or 16 bit words or 32 bit words?

Is what I wrote wrong?

My two sentences were intended to have the following meaning.

The unicode system is today a 21 bit system. Details of unicode, which is today a 21 bit system, are available at the http://www.unicode.org website.

I am fond of precision and try to be precise, so, if my statement is wrong I will happily change it. Yet what should I change it to become? As far as I know, unicode is a 21 bit system. I accept that I cannot find a direct statement to that effect on the Unicode website. I ask that readers please consider how someone approaching the Unicode website for the first time and wondering what is Unicode, and then finding out that it encodes the characters of the languages of the world and symbols for mathematics and so on actually finds out that it is a 21 bit system, in that he or she may have heard of ASCII and know that it is 7 bit or even 8 bit and wonders how exactly Unicode manages to encode all the characters for all of the languages of the world. Perhaps the Unicode website needs some more direct information straight off, as unfortunately the size of the coding space only gradually gets through. Indeed, perhaps the first paragraph of Chapter 1 of the Unicode specification needs to include a mention of the 21 bit coding space so that the new user of Unicode can immediately get a grasp of what is going on. There is a note on the page that I am reading, and I note that I saved the file to hard disc on 22 February 2001, that this excerpt from the book, The Unicode Standard, Version 3.0, has been slightly modified for the web, so I do not know what the book says as the nature of the modifications is not stated on the web page. Please know that I am not trying to quibble my way out of criticism by in some way seeking to purport that precision is unimportant, for I believe that precision is important and I don't like it when other people try to pretend that precision is unimportant in order to seek to justify imprecision, for precision is the basis to success, yet is my statement that the unicode system is today a 21 bit system even imprecise even if it is not wrong? Let us go into this as precisely as we in this discussion group can go.

So, is the unicode system today a 21 bit system?

Is there anyone prepared to state that it is not a 21 bit system, stating reasons for so saying?

Now, thinking about this whole situation it appears to me that, looking at my idea that characters from a cuneiform type tray within the Private Use Area and characters from a eutocode type tray within the Private Use Area could be used together within the same plain text file so as to store both character codes and data about the three-dimensional physical shape of a particular clay tablet which carries cuneiform characters upon it, that there is scope for considerable future development.

It would seem possible that, in time, cuneiform characters will be designated as regular Unicode characters in a plane that is not plane 0. It seems to me that, unicode using 21 bits, and whereas computer storage media such as hard discs is oriented to 8 bit bytes, that there is scope to develop a file coding format that is essentially of a plain text nature where characters are treated as being 24 bits long, so that each character is stored as 3 bytes and code points starting, at 24 bits, expressed in hexadecimal, having the first hexadecimal character as 0 or 1 is reserved for Unicode, 2 is unused, and the rest are used for inputting and manipulating data. Such a special file format and coding system might be very useful for encoding physical data about cuneiform tablets and Unicode character codes together in one essentially plain text file, the format of which could be open so that everyone could use it without having to buy specific proprietary software to handle that file format.

I have it in mind that the various sections would be as follows.

000000 to 10FFFF Unicode 110000 to 2FFFFF reserved 300000 to 3FFFFD control, though only a few of these code points are going to be used. 400000 to 4FFFFD obey the current x0p process, load 18 bits of data into register x0, obey the current x0q process. 500000 to 5FFFFD obey the current y0p process, load 18 bits of data into register y0, obey the current y0q process. 600000 to 6FFFFD obey the current z0p process, load 18 bits of data into register z0, obey the current z0q process. 700000 to 7FFFFD obey the current t0p process, load 18 bits of data into register t0, obey the current t0q process. 800000 to 8FFFFD obey the current x0a process, load 18 bits of data into register x0, obey the current x0b process. 900000 to 9FFFFD obey the current y0a process, load 18 bits of data into register y0, obey the current y0b process. A00000 to AFFFFD obey the current z0a process, load 18 bits of data into register z0, obey the current z0b process. B00000 to BFFFFD obey the current t0a process, load 18 bits of data into register t0, obey the current t0b process. C00000 to CFFFFD obey the current x1a process, load 18 bits of data into register x1, obey the current x1b process. D00000 to DFFFFD obey the current y1a process, load 18 bits of data into register y1, obey the current y1b process. E00000 to EFFFFD obey the current z1a process, load 18 bits of data into register z1, obey the current z1b process. F00000 to FFFFFD obey the current t1a process, load 18 bits of data into register t1, obey the current t1b process.

The codes from 400000 through to FFFFFD are such that the two least significant bits are always set to 00, so as to avoid any problems with any ending FE or FF being encountered whilst still having total coverage of the data set.

The 24 processes x0p, x0q and so on all have default actions. There is a choice of processes available, which can be set using 24-bit codes that begin with hexadecimal 3. For example, defaults might include actions such as do nothing and move pen and draw line and move data to oldx, oldy, oldz and oldt registers so as to produce a vector graphics drawing system with a data resolution of 36 bits in x, y, z and t axes. Data storage needing 18 or less bits of resolution would not use the codes from C00000 through to FFFFFD at all. There could be processes that could be chosen that do such things as a choice for z0p that will autoincrement x by one step so that a sequence of 24 bit codes could be used to encode a sequence of values of z which automatically increment x by one step for each value of z that is supplied. This would mean, for example, that physical scan data for heights of clay in a clay tablet in a scan along a line could be encoded in a fairly packed manner, yet the file coding having the flexibility to include tags that link specific Unicode cuneiform characters to specific areas of the clay tablet. Please know that this is suggested as a general overview. A lot of research needs to be done in order to devise a coding system that would be useful in practice rather than simply hopefully just an interesting speculation on the possibilities as is presented here.

Yet I feel that such application possibilities to be able to use Unicode characters in conjunction with graphic data with everything encoded together in an open format file are an important possibility for the future.

> >>Within my classification system, suppose please that someone developing the >>character codes for Egyptian hieroglyphics requests > >Of whom? Of the maintainers of the "type tray" maintainers (an >analogue to John Cowan and me, for ConScript)? But why? ConScript has >a number of fun scripts in it and people might be interested in >encoding or exchanging more than one script, that's why there's a >central registry. But it seems to me that people exploring the >encoding of Egyptian and Cuneiform won't be worried about an overlap >-- apart from me, and if asked I'd just get my fellows to assign two >separate blocks to do it just in case there were a problem. For those >scripts, though (or for Blissymbols, another candidate for PUA test >implementation), the intention would be to use PUA as a very >temporary stopgap for testing.

Well, it is, I feel, one of the difficulties of being an inventor, a matter which needs great tact and diplomacy, is that when putting forward a new idea that one wishes to become accepted, to balance carefully between offering leadership and acting in a self-centred manner, so that although I am happy to be the person of whom the requests are made if that helps get the idea into use, I am reluctant to suggest that it be me in case it look as if I am on an empire building trip, yet I am also reluctant to suggest that the requests be considered by some other person or committee lest it be thought that I am trying to get someone else to do the work rather than carry it out myself. Sigh. I had in mind a sort of informal arrangement whereby someone requesting a type tray designation, something which would only probably happen infrequently, would post the request in this unicode@unicode.org discussion forum and that any one of a number of people interested in the classification system could offer advice if a clash with some existing use would occur or if the suggested type tray code would be in the wrong part of the classification, yet that after any discussion had settled within a few days that any requester would typically have his or her request granted. I got the idea for this system as a result of observing the newsgroup alt.config in action in defining the alt.* part of the newsgroup hierarchy. It sounds unstructured, yet it works very efficiently in practice and people wishing to start new newsgroups are encouraged and helped to get their new newsgroup started.

When I first started researching on eutocode, I took great care to have a look at the ConScript registry, specifically to try to avoid clashes between code point allocation between eutocode and the ConScript registry. I am aware that I did not need to do that in the sense of legalistic obligation to do so, yet I felt that I needed to do so in order to gain knowledge of the state of the art of use of the Private Use Area. As it happens, eutocode only overlaps a little with the ConScript registry allocations of code points, though eutocode was, in fact, placed in the upper part of the U+E... section so as to be in the middle of the Private Use Area so as to minimize clashes with other uses of the Private Use Area as far as possible, yet I was pleased that I had, at that time, avoided overlap with allocations in the ConScript registry. There is now, unfortunately, an overlap of eutocode with the ConScript Registry around U+E800.

Certainly, if you so wish, and I accept that you may not so wish, the ConScript registry could have a type tray code within the classification system, and, perhaps I might be allowed to request that the existence of the classification system codes might be noted in the ConScript registry in the sense that perhaps you might choose to avoid placing anything in the U+F3.. block.

In relation to the Private Use Area as mentioned in the Unicode specification I feel that it would be helpful if there were a note of guidance that the area U+F000 to U+F0FF is used for the symbol founts. I was aware that there was such an area and was trying to find out where it was located. Now, I appreciate that such a note might be seen as endorsement of specific allocations, yet the other way of looking at it is that someone could be trying to work within the Unicode specification rules and doing his or her best to do so, yet a piece of information like that, which is very relevant, is not easily available, perhaps because placing such a note in the specification would be seen as endorsement. Please consider, how, apart from picking the information up serendipitously by happening to read this posting or some other posting that happens to refer to it, how exactly is a person learning to use the Unicode system and to apply the Unicode system as an end user supposed to find out such information, which, after all, is quite important when it comes to allocating codes within the Private Use Area?

Maybe some sort of all but tacit understanding that, say, Egyptian Hieroglyphics and Cuneiform development within the Private Use Area do not overlap might be achieved. Yet such an arrangement would necessarily cut down on the number of code points available for each. Also, eventually, if some other learned group tried to "fit in" and neither overlap Egyptian Hieroglyphics nor Cuneiform in the Private Use Area, it might get impossible to do. Certainly, some scripts might have been promoted out of the Private Use Area by then, yet I feel that even where there is promotion to permanent places in regular Unicode, then there are three problems that need to be addressed. The first is that there may be works in existence that will not get updated - for example, a student project done with what was available at the time when the fount was in the Private Use Area. The second is that the Private Use Area founts don't get changed on all of the PCs: obsolescent encodings still get produced. The third is that some systems will not handle 21 bit unicode and so the old Private Use Area founts will remain in use. Someone in this discussion forum referred some time ago to the great tsu nami (tidal wave) where companies assume that everyone has converted to the latest version of their hardware and software. This is not always the actual situation, old PCs go on for years in college departments. A new PC often means that the total number available increases, not that any machine is actually discarded.

Thank you for mentioning Bliss.

> >>that he or she be assigned type tray 3001 and that someone >>developing character codes for cuneiform requests the assignment of >>type tray C001 and that I request type tray E001 for the eutocode >>system, and that all of these requests are granted. > >That's more or less how ConScript functions. > >>Then, in order to apply the classification system to any plain text file, >>the file needs to contain some classification characters near the start. >> >>For a file using the Egyptian hieroglyphics characters, the following >>sequence would be needed. >> >>U+F35B U+F333 U+F330 U+F330 U+F331 U+F35D > >I don't understand this. Just assign Egyptian and Cuneiform to two >separate areas.

No, it is important that the individual or group designating codes for Egyptian and Cuneiform have a good number of code points in order to have scope for their work. My suggested classification system means that they need to avoid only the U+F3.. block for their designations if they choose to use my suggested classification system: also, my suggestion can be applied retrospectively if someone has already prepared documents and founts as long as there is no clash with the U+F3.. block.

> >>Suppose then that one day someone comes across a plain text file and within >>that plain text file are character codes from the Private Use Area and that >>person has no idea as to which character set those character codes may be >>intended to represent. > >The person wouldn't, because PUA values are agreed between sender and receiver.

Or published.

Yet, what if that person is a researcher in a department of ancient languages in a university and there are all of these files on the hard disc of a computer, which files someone produced two years ago, a student who has since left?

>I think people working with Egyptian, Cuneiform, or Blissymbols will >use their own fonts for private research and can't imagine how a >central clearing house would benefit them.

It is not a central clearing house. It is just a list of type trays. A person or group that requests a type tray simply has, within the classification system, that type tray designation for its type tray: designation of individual code points within the type tray are not noted in the list. This would be radically different from the ConScript registry, which provides lots of interesting information about the various scripts.

William Overington

29 April 2002



This archive was generated by hypermail 2.1.2 : Mon Apr 29 2002 - 12:26:28 EDT