Re: UTF8 vs. Unicode (UTF16) in code

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Fri Mar 16 2001 - 05:38:57 EST


Quickly and initially stating that I am a relative novice in matters of
unicode and have no knowledge of the details of the other encodings, I am
unable to understand the (part) post copied below.

I am looking at the possibility of having hypercode, ranging from H+110000
to H+3FFFFFFF, that is, all of the 30 bit integers that are not included in
unicode.

Access to these characters would be by a sequence of six characters from the
private use area of unicode.

Given a uniengine with registers H2, H1, H0 as well as the accumulator A and
other uniengine registers, the sequence would be as follows.

----

A=(load 10 bits of data using a code in the range U+EC00 to U+EFFF)

H2=A, using U+EBE2.

A=(load 10 bits of data using a code in the range U+EC00 to U+EFFF)

H1=A, using U+EBE1.

A=(load 10 bits of data using a code in the range U+EC00 to U+EFFF)

H0=A and then draw the character given by (1048576*H2) + (1024*H1) + H0, by passing the value to the rendering engine, using U+EBE9.

----

The U+EBE9 is a two code ligature of U+EBE0 and U+EBE8.

Instead of using U+EBE9 as above, the code U+EBEE may be used (U+EBEE being a a two code ligature of U+EBE0 and U+EBED), to produce the effect of H0=A and then draw the character given by (1048576*H2) + (1024*H1) + H0, by obeying the uniengine software stored for that character.

That is, U+EBE9 would obtain the details of the shape of the character to be drawn from a font file, the U+EBEE would obtain the details of the shape of the character to be drawn from a sequence of uniengine drawing codes stored as a macro for that hypercode code, the definition of such macro either being earlier in the same document or stored in a hypercode font file.

----

My thinking is that it might be very useful to be able to use hypercode space as a way of uniquely defining characters that presently are ambiguous within the private use areas and also to have available code space where people may carry out activity that may not be encoded in unicode and may be outside of the (present) terms of reference for unicode.

For example, there is the possibility of assigning characters specifically to implement the moveable music type characters of the early letter press printers.

I have also seen mention of present day artistic activities in the unicode list. One of a gentleman now living in the USA who has drawn 4000 original characters and another a mention of a project called New English Calligraphy.

----

Yet the posting appended would seem to imply that at least one standard has reserved some (all?) of these codes either as "never to be used" codes or as "might someday be used" codes.

So, my question is this. Which of the integer values from 0x110000 to 0x3FFFFFFF are reserved and by which standards please?

William Overington

16 March 2001

>> ISO 10646 can encode characters beyond UTF-16, and should retain >> this capability. > >This is technically correct. The wording in the standard states >"Planes 00 to FF in Groups 01 to 5F are reserved for future >standardization, and thus those code positions shall not be used >for any other purpose." That is a way of saying that SC2 *could* >encode characters there at some unspecified time in the future. > >However, the proposal that someone mentioned on this thread can >be seen in Item 3 of the PDAM 1 to 10646-1, currently under ballot, >which removes user Planes E0 .. FF and user Groups 60 - 7F, placing >all those code positions into the same reserved status, and >disallowing their use as private use codes. > >The *purpose* of that proposal is to restrict the committed encoding >space of 10646 to U+0000..U+10FFFF, so that UTF-16 and UTF-8 (and >UTF-32) are interoperable. > >Furthermore, SC2/WG2 is on record, in its minutes, resolutions, >and principles and procedures as not intending to encode anything >past U+10FFFF -- precisely because to do so would break interoperability >between UTF-16 and UTF-8. > >> There is a proposal to restrict UTF-8 to >> only encompas the same values as UTF-16, > >Actually, that is a separate proposal that has not yet been >floated, which would drop the 5- and 6-byte ranges of UTF-8, >since they are not necessary for UTF-16 interoperability. > >> but UCS-4 still encodes >> the 31-bit code space. > >Architecturally, this is still correct. 10646 structures the >codespace as 128 groups of 256 planes each, and the "Four-octet >canonical form" (UCS-4) requires the use of 4 "octets". So >this is a 31-bit code space. > >Practically, however, the impact of other restrictions, and >the requirement for interoperability of UTF-16 and UTF-8 (and >UTF-32), plus the SC2/WG2 principles and procedures, means >that the G-octet will always be 0x00, and the P-octet will >always be in the range 0x00..0x10. In other words, 10646 as >a *Coded Character Set* (as opposed to an architecture for >encoding) has a 21-bit code space. And SC2/WG2 is perfectly >aware that it would be highly inadvisable (and damaging to >its own successful standard) to exceed that limit. > >--Ken > >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT