RE: UTF8 vs. Unicode (UTF16) in code

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Fri Mar 09 2001 - 18:41:06 EST


On Fri, 9 Mar 2001, Marco Cimarosti wrote:

> It is not very clear to me what is included in Extension B: how is it
> possible to know something more about it?

Look at DUTR #27[1] (2001.2.23), section 10.1, and see if any of those
sources are ones that contain characters that are important to you. e.g.,
CNS 11643-1992, HKSCS, and JIS X 0213 are probably relevant to someone out
there exchanging data in it. The G- sources pretty much round out the
entirety of characters that have been documented in Chinese dictionaries
(although there are omissions that remain unencoded to date).

[1] http://www.unicode.org/unicode/reports/tr27/

No figures are given there, though, so the Unicode pipeline[2] and IRG
N777[3] (2000.12.20) would give one a better idea of the relative
weight of each source that went into it, e.g., according to the latter
source, the 19th century _Siku Quanshu_ collectanea, despite its size,
only contributes 522 (I presume this is after the _Kangxi Zidian_ and
_Hanyu Da Zidian_ dictionaries have already had their monstrous five-digit
shots at contribution.)

[2] http://www.unicode.org/unicode/alloc/Pipeline.html
[3] http://www.cse.cuhk.edu.hk/~irg/irg/N777_CJK_B_CoverNote.pdf

> But the discussion was about porting existing applications to Unicode for
> the purpose of being able to localize/use them in new markets.
> Imagine concrete cases. E.g., I do software for the retail industry.

Okay, retail... you mention receipts, but these ultimately are connected
with purchase orders, invoices, manifests, inventories, etc--am I right?

How about the case of a retailer who needs to deal with parts for
elevators and needs U+282E2, lip 'elevator'? Or neckties, requiring
U+27639, taai 'tie'.

One can make a special Big5-HKSCS port of the product just for the
Cantonese-speaking market, which will suffice for HK-internal usage, but
what happens when one has to interchange data with factories in mainland
China or southeast Asia, who don't use Big5-HKSCS (or even plain vanilla
Big5)? Not to mention the Cantonese market outside of HK, such as right
across the border in mainland China. Etc etc. e.g., I don't think you'd
want to maintain a (hypothetical) "Hong Kong" edition of your product as
well as a "Guangdong" edition, nor would your clients want to pay for and
use multiple incompatible systems. And even Big5-HKSCS is only a
temporary solution until Unicode (and software based on it) can meet the
same needs.

And what if say, in the near future, you need to ship a Taiwanese-capable
(i.e., Southern Min) product? All that effort for Cantonese is not really
reusable, compared to the fruits of the initial pain and cost of
implementing non-BMP support.

But, to get away from Cantonese specifically for the moment, how about
something as simple as printing the names and addresses of customers,
factories, distributors, et al? This is a problem that afflicts Japanese
and Chinese (across the board). Wouldn't one have a mailing list or other
records with the names of these people and businesses, even if it doesn't
show up on a receipt? (But think how it would be appreciated if you did
print someone's name correctly!) If someone or some place's name happens
to require a character from Plane 2, what're you going to do?

> My managers could come and ask me to localize our solution for a retailer,
> based in South China, who want their receipts and GUI messages to be in
> Cantonese.
> In *this* case I can push Unicode and fully justify the burden of UTF-16
> support and, especially, the burden of checking that all programmers in the
> team behave themselves with strings (e.g., they won't trim strings blindly,
> leaving a lonely high surrogate at the end of it).

I was wondering earlier what kind of Cantonese messages would appear on a
receipt or GUI. There is the issue that people who can read and write
Cantonese are also diglossic in the mainstream standard written Chinese
(based on Mandarin), which is understood by all schooled Chinese. (This
situation is somewhat similar to that of people who are content to use the
US English version of a product rather than one in their own language.)
Written Cantonese has a stigma (albeit decreasing) of being "too
vernacular", and most of its uses are for transcription or depiction of
speech, such as a newspaper or magazine interview, movie scripts and
subtitles, fiction and comic books, print media ads, etc. But perhaps
the client would like their receipts to include a little slogan at the
bottom in Cantonese specific to the particular business, perhaps to show
that they cater to the customer's specific needs, c.f.,
exampleretailstore.ca or exampleretailstore.co.uk domains being clear that
they are doing (or imply they know how to do) local business, rather than
an ambiguous or generic exampleretailstore.com .

> But you can imagine how winning would be the argument of UTF-16 for printing
> pentagrams or on receipts (or algebraic formulae, or an aborted orthographic
> for English, or the script used in Viet-Nam centuries ago)...

Pentagrams? I haven't seen those... where are they?

Thomas Chan
tc31@cornell.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT