RE: UTF-8 syntax

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Fri Jun 08 2001 - 14:14:44 EDT


Jianping,

UTF-16 is an encoding system for Unicode. Encoding does not indicate sort
order. It is just encoding that is all. It you want to compare two fields
that should be compared in either the collating sequence for the locale or
Unicode code point order.

If I follow your argument further we should insure that EUC-J, Shift-JIS,
iso-2022-jp and Unicode have the same sort order.

It is not hard to compare UTF-16 data in code point sequence. DO THE RIGHT
THING!!!

What is really bad about Oracle's proposed UTF-8 implementation is that the
incorrect encoding is called UTF8 and the real UTF-8 is called AL32UTF8. Be
honest with your users. Let them know the real facts. If they use your
UTF8 encoding that can get into trouble. Do it now before users get into a
migration jam.

  UTF8

  The UTF8 character set encodes characters in one to three bytes. Surrogate
pairs
  require six bytes.

  AL32UTF8

  The AL32UTF8 character set encodes characters in one to three bytes.
Surrogate pairs
  require four bytes.

From this documentation it would seem that UTF8 is the real thing and the
AL32UTF8 is an Oracle special encoding.

If you are going to have a non-compliant encoding that you should call it
AL16UTF8 and call the other UTF8. This would be consistent with your
AL16UTF16 encoding selection. You should also change the documentation to:

  UTF8

  The UTF8 character set encodes characters in one to three bytes. Surrogate
pairs
  require four bytes.

  AL16UTF8

  The AL16UTF8 character set encodes characters in one to three bytes.
Surrogate pairs
  use a non-standard encoding that requires six bytes. This encoding
provides that same
  sort order as AL16UTF16 but will not work with standard UTF-8 encoders and
decoders.

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Jianping Yang
Sent: Thursday, June 07, 2001 6:51 PM
To: Peter_Constable@sil.org
Cc: unicode@unicode.org
Subject: Re: UTF-8 syntax

I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
in
UTF-16 code unit which means one UTF-16 code unit will be mapped to either
one,
two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
UTF-8S, it should also apply to UTF-16, which does not make sense to me.

Regards,
Jianping.

Peter_Constable@sil.org wrote:

> On 06/07/2001 10:38:15 AM DougEwell2 wrote:
>
> >The ambiguity comes from the fact that, if I am using UTF-8s and I want
to
> >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> the
> >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> (valid)
> >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> B0
> >80>. Unless you have a crystal ball or are extremely good with tarot
> cards,
> >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> >80>, to know whether it is supposed to be mapped back to <D800 DC00> or
to
> ><10000>.
>
> This brings out a good point. We can't yet say that UTF-8s is ambiguous
> since it is not formally defined. What this does highlight, though, is a
> gap in the proposal that must be addressed before it could be considered:
a
> well-formed definition for UTF-8 must (by D29) provide a *unique*
> representation for *all* USVs, and unless the proposal is amended to
remove
> D800 - DFFF from the codespace, it must be amended to provide some unique
> means of representing things like U+D800. What it is *not allowed* to be
is
> ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then
it
> must provide some sequence other than <ED A0 80> to mean U+D800.
>
> >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > I think we agree on this.
>
> Yes.
>
> >Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> > I tried to prove this above.
> >
> >Conclusion: Unicode should not define UTF-8s.
>
> I definitely agree with the idea your getting at, but am just looking from
> a very slightly different angle. The conclusion does not necessarily
follow
> because UTF-8s is only a proposal that potentially can be modified. If you
> say, "UTF-8s as has been currently proposed would be inconsistent with
> D29", then I agree. The proposed definition for UTF-8s *could* potentiall
> be revised, though, and so the argument that a UTF-8s cannot be added to
> Unicode doesn't hold.
>
> UTF-8s definitely is not tenable as currently proposed, given the current
> definitions. I think we agree on that.
>
> - Peter
>
> --------------------------------------------------------------------------
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT