CJK tags - Fish or cut bait

From: Pete Resnick (presnick@qualcomm.com)
Date: Fri Jun 20 1997 - 17:11:23 EDT

After Martin's message, I've decided it's high time to write this.

I believe the Unicode Consortium needs to define some reserved characters
for *script* tagging, at least for CJK. I believe that though true language
tags have other purposes which may be useful, distinguishing CJK is the one
important thing that is needed in international plain text, and ACAP could
probably get away with just that.

Let me tell a little story:

As you may know, I am working on the internationalization of Eudora (among
other things) on the Macintosh. Let us start with the following working

1. The Macintosh WorldScript Manager gives me the ability to let the user
edit multiple-script plain text. Though encoded at the low-level using font
information on the Mac, there is a way to specify "default font for this
script", thus leaving me with multi-script runs of plain text. There is no
style information other than the script tag.

2. The script information is important to the user, not for display in the
"appropriate" font (which is nice, but not completely vital), but as an
absolute requirement to choose the correct input methods. It would be
unacceptable to have the user edit Japanese text using a Chinese input
method. On the Macintosh, you can have Japanese and Chinese input methods
installed at the same time, and the UI automatically chooses which one to
use based on the script of the text. (Remember, the script of the text was
predetermined by either what input method the user chose to type it in
originally, or from some external information).

3. Eudora has several fields that allow both display and editing of
multi-script plain text. This includes message Subject: fields, the real
name portion of message address fields (which appear not only in messages,
but in Address Book entries), the Search Criteria line in the Find dialog,
as well as some other editable dialog items. (Note that because such text
can occur in Address Books and other settings fields, it means it can also
occur in ACAP.)


4. The Macintosh also provides me with a Unicode converter. It is able to
translate from Macintosh script runs (as well as all of the Internet
encodings like 2022-JP and 8859-x) to Unicode, and convert from Unicode
back to script runs.

OK, so where does that leave me: I want to use something as my storage
format so that I can have multi-script plain text without having to store
external script information or complex inline script information. Storing
external script information is too burdensome; keeping track of extra files
and/or data structures for simple strings of text is an unreasonable hit,
and would require a redesign of certain parts of Eudora. Complex inline
tagging (be it MIME character set tagging or HTML-like tagging) requires
that I strip this stuff as I display it, I'll have to deal with escape
sequences of some sort (for the "<" or "=" characters or otherwise) and it
increases storage requirements by a huge factor in, for example, the list
of subjects in a mailbox (which might all have to be tagged). The best
solution is clearly Unicode.

Only I've got a problem with Unicode: CJK distinction. When I convert to
Unicode, no other important information is lost for my UI except the CJK
distinction. All other information is either preserved roundtrip, or is
unimportant to the user in order to edit multi-script plain text in the
correct input method. But the loss of the CJK distinction is unacceptable.

External profiling is way too burdensome, as I mentioned above. Complex
inline tagging is completely unacceptable: I don't want to have any
knowledge of Unicode per se; I want to pass a string of Unicode to the
Macintosh Unicode conversion routines, say "Convert this stuff to script
runs", and insert it into the edit field. Parsing out inline tags is just
not in the cards.

There is a simple engineering solution: Use reserved characters in Unicode
as script tags. If these things were standardized, then I could just pass
the string unimpeded to the Unicode converter and it would "do the right
thing" when it did the conversion to script codes. In the absence of a
standard way of doing this, I can use private use zone characters, and then
when the Unicode converter says, "I don't know what to do with this
character", I can simply say, "This one means prefer Japanese" and it will
again "do the right thing".

The purpose of Unicode was to give me a reasonable way to store and
manipulate international plain text. Modern operating systems that don't
use Unicode natively but do have international plain text are able to
distinguish between CJK in order to provide input methods (among other
functionality). The failure of Unicode to be able to distinguish CJK means
that I *must* use some non-standard method to use Unicode as international
plain text.

It's time to fish or cut bait. Either define some tags to disambiguate CJK
in Unicode or I'm going to do it for you: I'm going to implement it in my
code using some (non-interoperable) characters in the private use zone, and
probably convince the ACAP people to do what I do, and then you're going to
be stuck supporting these characters anyway. You'll be stuck supporting
these because Eudora has a large number of users, and if we start sticking
this stuff out on ACAP servers in address book entries, you're going to
start to have these things flying around the net. No, I have no intention
of ever sending these things in e-mail messages (I'll use MIME encodings
with either HTML or text/enriched in the bodies, and language tagging or
other charsets in the headers), but eventually someone who is not as
enlightened as I am is going to let them leak out to the net.

Something needs to be done. It would be much better if the Consortium chose
the codepoints to use rather than leave it to my whims.


Pete Resnick <mailto:presnick@qualcomm.com>
QUALCOMM Incorporated
Work: (217)337-6377 / Fax: (217)337-1980

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT