Re: UTF-8 reg tags...

From: Francois Yergeau (yergeau@alis.com)
Date: Wed Sep 11 1996 - 01:25:24 EDT

Next message: Jonathan Rosenne: "Re: UTF-8 reg tags..."
Previous message: John H. Jenkins: "Re: UniHan CDROM database"
Maybe in reply to: Tex Texin: "UTF-8 reg tags..."
Next in thread: Jonathan Rosenne: "Re: UTF-8 reg tags..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

À 15:34 10-09-96 -0700, Rick McGowan a écrit :
>Maybe I'm coming into the middle of something and don't understand but...
>David Goldsmith said...
>
>> Actually, the internet draft that is being submitted to register UTF-8
>
>WHO SUBMITTED IT?

I submitted it. The first version went out in March, and a revision was
issued in August.

> Was the actual tag discussed with the UTC?

I have exchanged some mail with Gary Roberts, who claimed to be reviewing
the draft on behalf of the Unicode Consortium. He did not raise the actual
tag issue, so I assumed that the UTC did not have a problem with it.

The tag(s) to be registered was discussed at quite some length on the
ISO10646 list while preparing the draft. At one point there were some 7 tags
to distinguish 10646 implementation levels, subsets (BMP only, BMP + UTF-16,
full UCS-4) and versions. I asked whether there was a consensus for one or
many tags, and it turned out most people preferred the single UTF-8 tag,
giving good reasons.

>> This is what's going to be registered with IANA. I did try to talk them
>>into following the pattern previously established (and set down in RFC 1641)

This may be why Harald Alverstrand, on behalf of the IESG, raised the issue
with me in August. Here are his questions and my answers, neatly
summarizing the case for a single tag:

-------------------------------------------------------------------
>Francois,
>one question about your UTF-8 doc:
>Given the changing nature of 10646, which I know you know all about,
>how is a recipient to determine what version of 10646 may be expected
>in the body of the message?

There's no way. The issue was raised on the ISO10646 list when the draft
was discussed there, but the consensus was that it was not worth giving this
info. "charset=UTF-8" tells you how to decode the byte stream, but there is
no real benefit in knowing in advance what characters it will contain.
Either you can deal with newly assigned characters, or you cannot, but
knowing from the charset parameter doesn't help you.

>In particular, what about the Hangul Jamo? [sic]

In that case it would help (be necessary, in fact), but the only reason the
Unicode and ISO committees could accept this incompatible change was that is
was demonstrated that the original code positions were absolutely not used
anywhere. The backward compatibility problem is thus purely virtual.

>Or anything else that ISO might dream up in the future?

For what it's worth, the relevant committees have pledged not to enact any
incompatible changes in the future.

>Kind of argues for the "charset-edition" extra parameter that was
>documented in RFC 1922....most people will NOT want to care....

Agreed, it's not worth the trouble. Furthermore, registering something like
"UTF-8-pre-AM5" would solve the Korean problem if it ever showed up on the
Internet.

>Anyway, I've passed it for publication.
-------------------------------------------------------------------

So the draft was approved and should be given an RFC number soon. But the
RFC will be Informational (as it should be). It only suggests a
registration, which is not done yet as far as I know; if the UTC really
objects to this single registration, it's probably still time to block it.

Better consider the issue carefully, though. A MIME charset is designed to
give just the information needed to interpret a sequence of bytes received
on the wire into a sequence of characters, nothing more. As long as a
standard does not change incompatibly, version numbers serve no purpose,
because you gain nothing by learning *from the tag* that you're going to
receive newly assigned characters that you don't know about. You won't
learn anything about the new characters from the tag, and you're going to
receive them anyway. Once your software is updated to the latest version of
the standard, it can forget about the old one since it's compatible.

Now the "Korean mess" is an incompatible change, but I called the
consequences "virtual" above. If the UTC agrees with that, the matter is
closed; if it disagrees...

Regards,

-- 
François Yergeau <yergeau@alis.com>
Alis Technologies Inc., Montréal
Tél : +1 (514) 747-2547
Fax : +1 (514) 747-2561

Next message: Jonathan Rosenne: "Re: UTF-8 reg tags..."
Previous message: John H. Jenkins: "Re: UniHan CDROM database"
Maybe in reply to: Tex Texin: "UTF-8 reg tags..."
Next in thread: Jonathan Rosenne: "Re: UTF-8 reg tags..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT