Re: CJK tags - Fish or cut bait

From: Pete Resnick (
Date: Sat Jun 21 1997 - 13:00:51 EDT

On 6/20/97 at 6:37 PM -0500, Kenneth Whistler wrote:

>First, I must say that this is the first more-or-less specific and
>comprehensive engineering justification for language (or in this
>case CJK script) tagging that I have seen posted in the last two
>weeks on this topic. At last it is starting to make sense why there
>is such an insistent claim that CJK tagging is required in ACAP. This
>analysis of the Eudora requirements for supporting the Macintosh
>script systems for CJK is quite clear.

Thank you.

>> There is a simple engineering solution: Use reserved characters in Unicode
>> as script tags.
>This does seem both a simple and correct engineering solution to me.
>A Macintosh WorldScript compatibility problem for CJK can be addressed
>by 3 or 4 (?) Macintosh private use characters which retain the required
>information for roundtrip conversions in Eudora.

Actually, I've got 5 defined: "Japanese", "Korean", "Trad. Chinese", "Simp.
Chinese", and "unspecified" (just in case I want to turn off the preference
and let the user's system make the decision).

>> The purpose of Unicode was to give me a reasonable way to store and
>> manipulate international plain text. Modern operating systems that don't
>> use Unicode natively but do have international plain text are able to
>> distinguish between CJK in order to provide input methods (among other
>> functionality). The failure of Unicode to be able to distinguish CJK means
>> that I *must* use some non-standard method to use Unicode as international
>> plain text.
>This last statement could just as well be reworded:
> The architecture of the Macintosh WordScript support for CJK means
> that I *must* use some non-standard method to use Unicode as international
> plain text.

I don't necessarily see how any OS (Macintosh or otherwise) could possibly
deal with 16-bit Unicode as editable plain text without some sort of CJK
distinction. We all have to deal with input methods at some point.

>The right way to approach this in the Unicode Consortium (as an industry
>consortium, after all), is for Apple to bring in a clearly stated
>implementation requirement for these 3 or 4 (?) additional codes. John Jenkins
>has already proposed something of the sort in UTC, for just this problem.
>If Apple can then convince the rest of the industry, as represented in
>the Consortium, that the problem these corporate user-defined characters
>are addressing is widespread and common enough (and commensurate for each
>participant) to justify giving standard codes to them, then UTC would vote
>for them and work with WG2 to see they got into 10646.

I will talk to John about this, but it seems like the logical thing for the
UTC to do.

>> It's time to fish or cut bait. Either define some tags to disambiguate CJK
>> in Unicode or I'm going to do it for you: I'm going to implement it in my
>> code using some (non-interoperable) characters in the private use zone, and
>> probably convince the ACAP people to do what I do, and then you're going to
>> be stuck supporting these characters anyway.
>You'll run into a rock wall of opposition from both the Unicode Consortium
>and ISO if you pick some private-use characters and then try to force their
>standardization, contrary to anybody else's use of private-use characters.

Oh, absolutely. My point was that if I (or the ACAP people, or Apple) go
ahead and choose my own characters and they start getting widespread use,
the industry is going to eventually end up with a de facto standard to do
this. Then, if the UTC eventually gets around to thinking this is a good
idea and coming up with some other standard, you'll either have two
non-interoperable standards, or the one that UTC comes up with won't be
used because the de facto method will be ubiquitous.

>The ACAP people have also expressed a sharp disinterest in making use of
>private-use characters for language tagging in the ACAP protocol, precisely
>because of the interoperability and standardization problems.

Right. That's why I want the UTC to do the standards work.

>There is, however, a middle way through this dilemma.
>If the UTC accepts a low-impact means of conveying tagging in plain text
>(such as the Plane 14 proposal), that will probably meet the ACAP
>stated requirements for language (or other) tagging.

I've only got one big problem with the Plane 14 proposal: the Plane. The
Unicode implementation on the Macintosh (and I'm guessing on lots of other
platforms) only does good old-fashioned 16-bit Unicode. A tagging scheme
that requires 32-bit characters is, for all intents and purposes, useless
to me.

Now, that said:

>The Plane 14 proposal could be implemented directly to solve your
>problem for CJK script tags in Eudora, but the length of the tags
>might not be ideal for the kinds of fields you are talking about (e.g.
>tagging all the Subject fields). In that case, you'd still be better
>off implementing with a user-defined character.

Right. I have no objection to the Plane 14 proposal being used for
full-fledged language tagging. At that point, what you're saying is, "In
order to get full language tagging, as necessary for spell checking,
sorting, text-to-speech, etc., you're just going to have to be able to deal
with 32-bit characters." That's OK with me. I don't need full language
tagging in order to implemented simple international plain text editing.
All I need for that is simple CJK tags, and I want those to be in the
16-bit space, and standardized so more than just Eudora (or Macintosh
people) can use them.

>Then when the Macintosh
>edit fields meet the ACAP protocol, you put in a trivial piece of
>conversion code that converts your private tag to the form specified
>for the ACAP (or other Internet protocol).

OK, so there is an open question whether I will be required to deal with
Plane 14 language tags in order to use ACAP. If ACAP requires language
tags, and uses Plane 14, I'm going to have to deal with those tags.
However, as I stated in my original message, I'm not convinced that ACAP
really need honest-to-god language tags and probably could do with just the
CJK tags.

Personally, I don't think ACAP is going to pass the IETF "interoperable
implementations" test if they require Plane 14 language tags, or any 32-bit
10646 for that matter. I'm pretty sure the Macintosh Unicode converter (in
its current incarnation) would spit back error messages on 32-bit
characters at this point, and I'll bet there other Unicode implementations
that do the same thing. But that's an issue for the ACAP group and the
IETF. I'm not saying language tags are a bad idea, nor am I saying that
standardizing them in Plane 14 wouldn't be a fine thing. I'm just not
convinced that ACAP (or Eudora) will be using them any time soon.

I urge the UTC to adopt some set of characters for CJK distinction within
16-bit Unicode.


Pete Resnick <>
QUALCOMM Incorporated
Work: (217)337-6377 / Fax: (217)337-1980

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT