RE: [Proposal] Extended UTF-16 by using Plane 14

From: Christian Wittern (chris@ccbs.ntu.edu.tw)
Date: Tue Apr 13 1999 - 23:40:38 EDT


Geoffrey wrote:

> I understand the problem quite well. They want to adapt Unicode to encode
> a large swath of non-Unicode type data and are now asking the rest of the
> world to modify their software so that this non-Unicode data will be
> processed in some fashion correctly.

Hmm, err, this is not exactly the case. Two things are converging here:
One is the need to deal with characters not currently in any standard. If
the standard is the current Han-Character Set of Unicode, this represents
less than 0,1% of most texts. I therefore don't think at this point, that
they need to be encoded in a standard. They need to be represented in the
texts, however. Also, to avoid loss of information, the texts are usually
input with the variants actually used in the printed versions, while for the
publication, depending on the needs of the intended user community, a
process of normalization and unification is applied.
The other thing is, that a group in Japan has spent the last ten years to
collect a large number of Han-Characters. They created a set of Truetype
fonts for more than 90 000 characters, which is available for free download
at their website (http:www.mojikyo.gr.jp). This collection contains the
Unicode Han area as one of its many subsets. We find it now convenient to
use Unicode as a base character set and reference characters from this set
as necessary. This provides a way to deal with these characters before they
are encoded by standard bodies (and in fact will provide information on
usage and frequency, which will help the standard bodies to decide, which
characters to put into the appropriate slots. In this sense, it is of course
a temporary solution). Now, although we don't want to use the characters
from this collection that are already in Unicode, it still seems to be the
easiest and best, to make this whole collection wholesale available to the
users. This is what Mr. Maedera is trying to add to his Unicode Editor.

> There are already 2 encodings for
> ISO-10646 which will allow them to store huge quantities of non-Unicode
> data compatibly. Given the implication that they would not be using the
> existing CJK in the BMP, I would think that both UTF-8 and UCS-4 are more
> space and processing efficient than a stream of what would be mostly
> 12 octet sequences in their data. UTF-16 was designed for something
> different from what they are trying to accomplish.

As explained above, your implication is wrong.

Christian Wittern



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT