CJK tags - Prepare for the future (was: Fish or cut bait)

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Fri Jun 27 1997 - 07:23:55 EDT

On Fri, 20 Jun 1997, Pete Resnik wrote:

> After Martin's message, I've decided it's high time to write this.

Hello Pete - I don't know which of my messages triggered this,
but anyway, I'm happy it did, as it presents some interesting
new aspects in the discussion. This answer may come a little
late, as things have already progressed much between you and
John, but I hope it is still useful.

> I believe the Unicode Consortium needs to define some reserved characters
> for *script* tagging, at least for CJK.

I'm glad you didn't refer to CJK glyph disambiguation, but to input
switching problems.
I'll use the term "Mac script" below to refer to your use of script.

> 2. The script information is important to the user, not for display in the
> "appropriate" font (which is nice, but not completely vital), but as an
> absolute requirement to choose the correct input methods. It would be
> unacceptable to have the user edit Japanese text using a Chinese input
> method. On the Macintosh, you can have Japanese and Chinese input methods
> installed at the same time, and the UI automatically chooses which one to
> use based on the script of the text. (Remember, the script of the text was
> predetermined by either what input method the user chose to type it in
> originally, or from some external information).

I am very familliar with the feature that conforming Mac applications
automatically change the input method if the cursor is moved to a
different place (with a different Mac script). Based on my personal
experience, I'm not at all convinced that this is the right feature.
These are my reasons:

- The user is not in control. The system is second-guessing.
- It does not work that way for subsets of a script (for consistency,
        I might expect that the keyboard switches to Katakana if I
        move the cursor to a place with Katakana characters, but it
        only switches to Japanese, with whatever "submode" I happen
        to have used last).
- It does not work that way for pieces of Roman text that are
        encoded as part of another Mac script. (In my case, this
        is particularly annoying because I have set the Finder to
        use a Japanese font, but most of my filenames are Latin
        and I don't want to use the Latin submode of Japanese
        because it is unable to handle Dvorak correctly.)
- It does not work nicely for repeated insert/delete. E.g. you
        start to insert something in script A into script B,
        then correct it up to the point that it vanishes completely,
        and you have to reset your input method.

There are also a few points that make switching of input methods
more clumsy than it could be:

- Command-space gets you through the selected input method of each
        script. But it is impossible to switch off a script unless
        you deinstall it. (This is particularly annoying in cases
        where you want to have a lot of scripts available for
        instruction, debugging, and reference, but rarely use most
        of them.) Somebody at Apple has told me that a fix for
        this is being considered.
- When you switch through input methods, you have to wait for each
        method to get started up before it switches to the next,
        even if you never intend to use that method. CJK
        dictionary-based methods take a lot of time to start up.

I would therefore prefer an application that would give me
full control over the selection of the input method at any place.

> 3. Eudora has several fields that allow both display and editing of
> multi-script plain text. This includes message Subject: fields, the real
> name portion of message address fields (which appear not only in messages,
> but in Address Book entries), the Search Criteria line in the Find dialog,
> as well as some other editable dialog items. (Note that because such text
> can occur in Address Books and other settings fields, it means it can also
> occur in ACAP.)

Not at all. Such functionality is great. It already exists in some
programs, and we need more of it!

> 4. The Macintosh also provides me with a Unicode converter. It is able to
> translate from Macintosh script runs (as well as all of the Internet
> encodings like 2022-JP and 8859-x) to Unicode, and convert from Unicode
> back to script runs.
> OK, so where does that leave me: I want to use something as my storage
> format so that I can have multi-script plain text without having to store
> external script information or complex inline script information.

If you are using Unicode as a storage format, so that for example
search finds the same characters in various Mac scripts, users will
slowly but steadily start to have their distinction between the
Mac scripts blurred. If the fonts used for the various scripts
differ strongly (the most visible difference is actual size at a
given point size and boldness, glyph variant differences are in
many cases much less visible), this may take some more time.

Users will then start to use a Chinese input method to input
a Japanese name and vice versa. For example, a Chinese who
reads Japanese to some degree and wants to record a Japanese
name in her address book will prefer to input a character
of which she doesn't know the Japanese pronounciation with
a Chinese input method (either simplified or traditional)
rather than having to scan through the Radical-Stroke pannel
provided for Japanese (the input methods I have for Chinese
unfortunately don't even provide such a Radical-Stroke pannel).
This is definitely not very far-fetched; even currently for
Japanese, all kinds of such tricks are used to speed up the
input of rare or not-very-well-known characters. The relation
between the input method used and the language of the text input
starts to deteriorate.

As a consequence of this, relying on a strict separation of
CJK (actually GTJK) Mac Scripts may become contraproductive
at some point. For example, consider the conversion from a
name input as described above to a MIME mail header. The
straightforward approach for this is to take the individual
segments of the name in each Mac Script and convert them to
the appropriate "charset". [here you get into problems with
MIME, because encoded words in headers are words in the western
sense, i.e. separated by spaces]. Most traditional software
will not be able to handle this. The alternative is to not
rely too much on Script tagging, and to decide on the best
encoding (GB only or Big5 only or JIS only,...) based on
information about the receiver (for a nice example of such
an approach, see the Lys MUA, a prototype for SUNOS, at
ftp://ftp.crim.ca/pub/Lys/Lys_SunOS.tar.gz) or based on
heuristic lookahead (what could the sequence of Unicode
CJK character most probably be).

This change of model from a strict separation of various
Mac scripts to a closer integration is something that
probably also Apple is looking into, or should look into,
and that is important to keep in mind for long-living

> Storing
> external script information is too burdensome; keeping track of extra files
> and/or data structures for simple strings of text is an unreasonable hit,
> and would require a redesign of certain parts of Eudora. Complex inline
> tagging (be it MIME character set tagging or HTML-like tagging) requires
> that I strip this stuff as I display it, I'll have to deal with escape
> sequences of some sort (for the "<" or "=" characters or otherwise) and it
> increases storage requirements by a huge factor in, for example, the list
> of subjects in a mailbox (which might all have to be tagged).

I'm glad you mention length. I think this is a problem of the current
plane-14 proposal. Some people are not worried about this, but others are.

> Only I've got a problem with Unicode: CJK distinction. When I convert to
> Unicode, no other important information is lost for my UI except the CJK
> distinction. All other information is either preserved roundtrip, or is
> unimportant to the user in order to edit multi-script plain text in the
> correct input method. But the loss of the CJK distinction is unacceptable.

Given the above explanations, I'm not sure whether this is true.
There is clearly a need for a consistent user interface, and with
the current functionality on the Mac, and according to the Mac
implementations and guidelines, this is the clear separation
of the various Mac scripts for CJK. However, the more Unicode
functionality you add, the less this model will work consistently
for the user. Transition from one model to another is difficult,
but it is better to think about it early on than to try to stick
to an old model too long.

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT