Re: CJK tags - Fish or cut bait

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jun 20 1997 - 19:37:29 EDT


Pete Resnick wrote:
>
> After Martin's message, I've decided it's high time to write this.
>
> I believe the Unicode Consortium needs to define some reserved characters
> for *script* tagging, at least for CJK. I believe that though true language
> tags have other purposes which may be useful, distinguishing CJK is the one
> important thing that is needed in international plain text, and ACAP could
> probably get away with just that.
>

First, I must say that this is the first more-or-less specific and
comprehensive engineering justification for language (or in this
case CJK script) tagging that I have seen posted in the last two
weeks on this topic. At last it is starting to make sense why there
is such an insistent claim that CJK tagging is required in ACAP. This
analysis of the Eudora requirements for supporting the Macintosh
script systems for CJK is quite clear.
 
[snip]

>
> There is a simple engineering solution: Use reserved characters in Unicode
> as script tags. If these things were standardized, then I could just pass
> the string unimpeded to the Unicode converter and it would "do the right
> thing" when it did the conversion to script codes. In the absence of a
> standard way of doing this, I can use private use zone characters, and then
> when the Unicode converter says, "I don't know what to do with this
> character", I can simply say, "This one means prefer Japanese" and it will
> again "do the right thing".

This does seem both a simple and correct engineering solution to me.
A Macintosh WorldScript compatibility problem for CJK can be addressed
by 3 or 4 (?) Macintosh private use characters which retain the required
information for roundtrip conversions in Eudora.

>
> The purpose of Unicode was to give me a reasonable way to store and
> manipulate international plain text. Modern operating systems that don't
> use Unicode natively but do have international plain text are able to
> distinguish between CJK in order to provide input methods (among other
> functionality). The failure of Unicode to be able to distinguish CJK means
> that I *must* use some non-standard method to use Unicode as international
> plain text.

This last statement could just as well be reworded:

   The architecture of the Macintosh WordScript support for CJK means
   that I *must* use some non-standard method to use Unicode as international
   plain text.

The right way to approach this in the Unicode Consortium (as an industry
consortium, after all), is for Apple to bring in a clearly stated
implementation requirement for these 3 or 4 (?) additional codes. John Jenkins
has already proposed something of the sort in UTC, for just this problem.
If Apple can then convince the rest of the industry, as represented in
the Consortium, that the problem these corporate user-defined characters
are addressing is widespread and common enough (and commensurate for each
participant) to justify giving standard codes to them, then UTC would vote
for them and work with WG2 to see they got into 10646. If Apple cannot
convince the rest of the industry, then they should stay private-use
codes, since they would them seem to be serving a particular manufacturer's
implementation, and there would be no agreement about their need for
interoperability.

>
> It's time to fish or cut bait. Either define some tags to disambiguate CJK
> in Unicode or I'm going to do it for you: I'm going to implement it in my
> code using some (non-interoperable) characters in the private use zone, and
> probably convince the ACAP people to do what I do, and then you're going to
> be stuck supporting these characters anyway.

You'll run into a rock wall of opposition from both the Unicode Consortium
and ISO if you pick some private-use characters and then try to force their
standardization, contrary to anybody else's use of private-use characters.
The ACAP people have also expressed a sharp disinterest in making use of
private-use characters for language tagging in the ACAP protocol, precisely
because of the interoperability and standardization problems.

There is, however, a middle way through this dilemma.

If the UTC accepts a low-impact means of conveying tagging in plain text
(such as the Plane 14 proposal), that will probably meet the ACAP
stated requirements for language (or other) tagging.

The Plane 14 proposal could be implemented directly to solve your
problem for CJK script tags in Eudora, but the length of the tags
might not be ideal for the kinds of fields you are talking about (e.g.
tagging all the Subject fields). In that case, you'd still be better
off implementing with a user-defined character. Then when the Macintosh
edit fields meet the ACAP protocol, you put in a trivial piece of
conversion code that converts your private tag to the form specified
for the ACAP (or other Internet protocol). Basically, it would convert:

   <Mac Script Japan Pref Tag Char> <==> <IETF Lang Start><J><A><-><J><P>

Where the ACAP tag is entirely expressed in Plane 14 characters set aside
for such tags. (U-000E0000..U-000E007F)

This gets ACAP a language tagging scheme. It gets Unicode a generic tagging
scheme with easily ignorable syntax for processes that don't want to see
tagging. And it lets you have your cake and eat it to for the Eudora
internationalization work.

--Ken

>
> Something needs to be done. It would be much better if the Consortium chose
> the codepoints to use rather than leave it to my whims.
>
> pr
>
> --
> Pete Resnick <mailto:presnick@qualcomm.com>
> QUALCOMM Incorporated
> Work: (217)337-6377 / Fax: (217)337-1980
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT