RE: UTF8 vs. Unicode (UTF16) in code

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Fri Mar 09 2001 - 11:19:10 EST


On Fri, 9 Mar 2001, Marco Cimarosti wrote:

> Addison P. Phillips wrote:
> > [...]
> > currently there are no characters "up there" this isn't a really big
> > deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
> > characters in the supplemental planes... but they'll be
> > relatively rare.
>
> This reminds me of a question that I wanted to ask since a lot time: how
> rare is the most common of characters in the extended planes? Hmmmm... Maybe
> I should be clearer.
> Does it exist at least one character > U+FFFF that is commonly used in at
> least one modern language?

How about music and math notation?

But, yes. U+21075,[1] gan, is an aspect marker in Cantonese, that when
placed after a verb, denotes continuing action (roughly equivalent to
<-ing> in English). I don't think anyone would dispute the
indispensability or high frequency of this character.

[1] It looks like a left-to-right horizontal arrangement of U+53E3 U+7DCA.

There is pre-existing data with that character, such as:

HKSCS (Hong Kong Supplementary Character Set) has it at 0x9E44, as does
its predecessor GCCS (Government Chinese Character Set). One can buy
Chinese handwriting recognition and OCR software that support at least
GCCS.

Vendor extensions to Big5 which predate GCCS and HKSCS, and which have
been smaller in size (i.e., only the more frequently-used characters)
include it as well, 0xFA5E in Dynalab HK A, and 0xFAD9 in one of
Monotype's extensions.

> I am wondering especially about the CJK characters in Extension B. We all
> know that the majority of them are rare, ancient or idiosyncratic
> characters, but I am not quite sure that this is true for *all* of them.

I probably wouldn't use "idiosyncratic" as an adjective to describe the
*majority* of them, but "rare" and "ancient" (perhaps "historical"[2]
would be a better word choice?) are correct.

[2] e.g., the "recently deceased", such as Vietnamese chu+~ no^m
characters in Plane 2, or even Deseret in Plane 1.

 
> I think that this is an important question for deciding whether an
> application should use 32 or 16 bit characters internally, and whether an
> application has to be fully UTF-16 aware or it can be "UTF-16 ignorant".
>
> E.g., imagine designing an application that will be localized in Cantonese:
> it is important to know whether all characters needed in Cantonese are in
> the BMP, or if some of them are in Extension B.

Some of them are in Extension B. HKSCS is unfortunately a mix of
characters needed in Cantonese, and characters needed in Hong Kong (the
two are not necessarily the same thing). Rather than trying to figure out
what all the characters used in writing Cantonese are, which is an
open-ended set, it is simpler to make the assumption that any characters
needed for Cantonese that are worth supporting have already made it into
HKSCS. Then make a decision based on whether one will support legacy data
from HKSCS. (In some cases, one does not have a decision to make, if it
is mandatory--e.g., a product that will be used by the HKSAR government.)

It doesn't have to be an application localized into Cantonese necessarily
even; just one that can process Cantonese text, e.g., for court
transcription purposes.

It seems that current practice in software is to stuff the characters from
HKSCS into the BMP's PUA area, sans unification. Hopefully this will only
be a temporary phase.

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT