Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Tue Feb 20 2001 - 12:48:08 EST


The following statements have been made by participants in this thread.

1.

A few days ago I said there was a "widespread belief" that Unicode is a
16-bit-only character set that ends at U+FFFF. A corollary is that the
supplementary characters ranging from U+10000 to U+10FFFF are either
little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.

2.

Can we put this thread on a constructive footing? I am sure there is
lots of outdated and/or incorrect information out there and I would
like to preempt its being identified via numerous emails here.
If the belief is there are misperceptions that need to be corrected, how
should the problem be addressed? Bear in mind the volunteer nature of the
organization....

----

I wonder if some readers might like to have a look at a specific situation. This would certainly help me and might also provide a useful case study on the practical problems.

I do not purport to be an expert in unicode. Unicode is but one of many interests. I do recognize that unicode is attempting to be a comprehensive standard system and I would like to do what I can within my own research to utilize the unicode system.

As some readers may remember I am producing a computer language called 1456 object code (in speech, "fourteen fifty-six object code") which is a computer language expressible using 7 bit ascii printing characters and which may be included in the param statements of an applet call in an HTML page. The applet called then calls a Java class file named Engine1456.class and quite substantial computations with graphic output may be achieved using a combination of ready prepared standardized Java classes and programs written in 1456 object code using a text editor. The benefit is that people who either do not know Java or do not have Java compiling facilities available may reasonably straightforwardly produce, using just a text editor such as Notepad, quite elegant graphics programs with Java quality graphics. There is a speed overhead, but, even for fast running programs, a 1456 object code program can get up to about 40% of the speed of a specially written Java program. With programs that wait for user input, the difference in speed may not be noticeable.

The system is fully described on www.users.globalnet.co.uk/~ngo which is our family webspace in England and readers are welcome to study it in full if they so wish, yet only a few documents need to be studied, and then only in part, for the purposes of this case study.

The 1456 object code system relies for its underlying standardization that the software that interprets the 1456 object code (that is, the 1456 engine) is written in Java. Therefore 1456 object code immediately fits in with being useable with a standard Java enabled browser on the internet and also to being useable on the JavaTV system as telesoftware. As JavaTV may well become a worldwide broadcasting standard there is practical importance in 1456 object code having full capability for being able to handle character strings in all languages that are encoded in unicode.

Characters are introduced into the 1456 object code system documents in the document

www.users.globalnet.co.uk/~ngo/14560600.htm

where 1456 object code characters are said to be "represented using the 16 bit unicode characters of Java."

There are various registers explained. The two key items though for this discussion is that one may load a character from the software into a register as a sort of "load immediate" type instruction in two ways.

A 7 bit ascii printing character may be loaded using a two character sequence consisting of the ^ character followed by the desired character. For example, ^E can be used to encode the character U+0045 in the software.

Any 16 bit unicode character may be loaded by a six character sequence consisting of 'u and four hexadecimal characters. So, the character U+0045 could be loaded using 'u0045 in the software.

Clearly, the six character method can be used for more characters than the two character method, as the two character method can only be used for the characters that can be entered as 7 bit ascii printing characters from the keyboard when programming.

Please note that when the 1456 object code is being obeyed the character that follows the ^ character is already existing as a 16 bit Java unicode character within the software, the conversion from 7 bit ascii to 16 bit unicode having taken place when it was loaded into the applet from the param statement of the applet call.

The page

www.users.globalnet.co.uk/~ngo/14560700.htm

shows how the six character method using 'u may also be used in the entry of strings of characters.

The next page that is needed for this case study is

www.users.globalnet.co.uk/~ngo/14561100.htm

and within that page the demo2.htm example.

Within the source code of the demo2.htm file there are the following uses of the six character method.

'u00e9

'u0108

'u011d

For example, the sequence

[ Caf'u00e9]

is used to load the four character string Cafe from the software where there is an acute accent on the e of the word Cafe.

After that, the 'u method is used where needed to produce desired effects. It proved very useful to write the software that produced the diagram used in the document

www.users.globalnet.co.uk/~ngo/14563100.htm

later in the sequence. The diagram is near the end of the document.

In that software, the characters

'u03b1

'u03b2

'u03b3

'u03be

were used.

The fonts that I have used are from Microsoft as mentioned in the document

www.users.globalnet.co.uk/~ngo/14561100.htm

mentioned previously. There are about 600 characters available, which is well less than the 65536 that the 'u command could produce. There are latin characters, greek characters and cyrillic characters and more.

Having set the scene of how I apply unicode to my own application at present, the question arises as to how to proceed to use the full unicode system.

I am quite happy to designate 'v followed by however many characters is judged necessary as being the way to load a however many bit unicode character into a register from the software. Perhaps that is 'v followed by eight hexadecimal characters, or maybe that is 'v followed by six hexadecimal characters. I can use 'V and 'v without any problem if that is what is needed.

Yet two further matters arise.

1. What about the fact that Java uses 16 bit characters?

2. Even if I code the extra characters using some system involving 'v and maybe 'V commands and however many hexadecimal characters following and storing them in the software, how am I supposed to display them on the screen? Are these characters available in font files? Suppose that I am needing to use an application where only, say, ten of these extra characters are used out of the large number of codes that are available, akin to the fact that the fonts that I am using have characters for only about 600 of the 65536 possible codes, can an ordinary font file be used to code these ten characters with the large code numbers? I would quite like to have a go at encoding the 'v and maybe 'V in a reasonable manner and trying it out with real data for real characters.

I have tried in a posting, with reference to just a few web pages, to provide sufficient detail of the practical problem that I face in relation to the matters raised in this thread and wonder if the people who are specialist in unicode might like in their resolution of this thread to seek to prepare a document such that someone who is not a unicode specialist yet is trying to apply unicode to a real project where the unicode aspect is but one part of the project may straightforwardly find an explanation of the unicode system sufficient to be able to understand and program the underlying structure into software and apply that structure correctly using font files. Such a document would be very helpful. If it already exists, I would be pleased to know of a reference to it.

William Overington

20 February 2001



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT