Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Peter_Constable@sil.org
Date: Tue May 29 2001 - 09:20:42 EDT


On 05/27/2001 08:03:37 PM Jianping Yang wrote:

>>But it seems to me that we've lived without
>>Premise B in the past, and that it won't benefit us to adopt it now. Why
>>bother with it? Why not continue doing what we already know how to do?

>As a matter of fact, the surrogate or supplementary character was not
defined
>in the past, so we could live without Premise B in the past. But now the
>supplementary character is defined and will soon be supported, we have to
>bother with it.

As others have mention, supplementary characters weren't conceived of
yesterday; UTF-16 has been formally part of the Unicode standard since July
1996, and was in the works prior to that.

Also, reviewing Premise B:

<quote>
Premise B (rev'd): Encoding forms maintained by UTC should all yield a
common binary order.
</quote>

Premise B has not been valid since it first made sense to talk about it in
1996. There is no logical connection between Premise B and the fact that
assigned supplementary-plane characters were formally introduced in TUS3.1.
So your statement "we could live without Premise B in the past [but] ...
now we have to bother with it" doesn't make sense. What you should have
said was, "We could get away with ignoring the binary sorting difference
between UTF-8 and UTF-16 in the past, but now we can't." That is a valid
statement. It does not, however, logically imply any particular solution.
It most certainly does not imply that Premise B must be made true.

Regardless of whether UTF-16 and supplementary-plane characters became an
issue for us years ago or just yesterday, your proposal hinges on the
assumption in Premise B. In my earlier message, I made the case that

- Premise B cannot strictly be made true without taking impossibly drastic
measures

- short of that, your proposed solution in pursuit of a less strict
interpretation of Premise B does not solve the problem you are facing since
in open systems you still have to support UTF-8, whether new encoding forms
UTF-8S/32S were made normative parts of the Standard or not

- if you are only concerned about your own closed systems, and open systems
are not a problem for you, then you can do what you like without UTC taking
any action

- any proposal that attempts to make Premise B valid, especially in a
strict interpretation but also with a less stricut interpretation, will be
very costly for everyone, far more costly than dispensing with Premise B,
letting people know it's invalid, and each one fixing their UCS-2 code to
work with UTF-16.

If you really want to push your proposal, then I think you really need to
make a case for Premise B, which means that you'll have to speak at least
to the points I've outlined. TUS 3.1 has *nothing* to do with it.

In the mean time, I think several -- at least Michael Kaplan, and I believe
he is right -- are inclined to think that the real issue is in your code:
you assumed in the past that you could ignore the binary sorting difference
between UTF-8 and UTF-16, but now you can't get away with such invalid
assumptions. You are not the first one who have had to clean up their act.

Regards,
- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT