RE: UTF8 vs. Unicode (UTF16) in code

From: Thomas Chan (
Date: Mon Mar 12 2001 - 11:55:42 EST

On Mon, 12 Mar 2001, Marco Cimarosti wrote:

> Thomas Chan wrote:
> > How about the case of a retailer who needs to deal with parts for
> > elevators and needs U+282E2, lip 'elevator'? Or neckties, requiring
> > U+27639, taai 'tie'.
> I am not seeking excuses to not implement UTF-16 -- rather examples of
> characters that *do* justify it.

I did not mean to imply that you were looking for excuses; sorry if it
came across that way. However, there are people who have potentially
legitimate reasons to conserve costs and resources by implementating a
subset, e.g., the "CJK Unified Ideographs" block in the BMP is one of the
first things to go when people want to make fonts lightweight, or do not
want or have expertise to draw all those glyphs. If the perception is
that effort for supplementary characters is only for
"rare/obscure/historic CJKV" (which is admittedly true for most
supplementary characters at the moment), then some people might not bother
with support for surrogates, UTF-16, etc. (And Plane 1 users like
musicians, mathematicians, and LDS would be "hurt" in the crossfire, too,
since it is an "all-or-nothing" matter.)

> And all your examples are perfectly valid: it would be crazy to tell users:
> "Sorry: because of software limitations, you cannot order ties or
> elevators".

I neglected to describe U+27639, taai 'tie'. It looks like a
left-to-right horizontal arrangement of U+8864 U+592A.

> <OT>
> Out of curiosity, are these loanwords from English? Or is it just a
> coincidence that they sound like "lift" and "tie"?
> </OT>

I don't think your question is entirely off-topic. Loanwords are one way
for a language to gain new words and morphemes, some of which will be
assimilated enough to eventually find a written representation. In the
case of Cantonese which is liberal enough to accept loanwords (rather than
preferring calques made of native morphemes, like Mandarin), but
conservative enough to prefer writing in Han characters (rather than
romanization, as preferred for Southern Min in some quarters), that means
there'll be new characters invented for some of these new words (when
existing characters are not reused, such as diksi 'taxi' \u7684\u58eb),
which results in new candidates to be added to Unicode.

To answer your question, yes, "lip" and "taai" are loanwords from
(British) English <lift> and <tie>. See for instance SUN Zehua's
"Xianggang de wailaici" (English Loanwords in Hong Kong)[1] for more

[1] The page is in Big5, and
written within the limitations of pure Big5, such as the graph U+6064 for
seut 'shirt', instead of the more preferable (and recent) U+88C7. i.e.,
Sun's page's emphasis is on loanwords, and not the orthography.

> However, I guess that Cantonese speakers might use dialectal terms (like
> "lip" and "taai" above) even when writing in literary Mandarin. And
> certainly they would not Mandarinize proper names.

Yes, as the written language is not the spoken language, "errors" do show
up, making the text less universally intelligible. In addition, there is
also a register difference between the Mandarinesque singgonggei 'elevator'
\u5347\u964d\u843d, and the above-mentioned "lip", that a writer may wish
to make use of.

> > Pentagrams? I haven't seen those... where are they?
> Hmmm... This is possibly an Italian word badly Anglicized. I just meant
> "musical notation".

Okay. I thought perhaps there were additions to "Misc Symbols" U+2600 ..
U+267F or elsewhere that I had missed.

Thomas Chan

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT