Re: UTF-8 syntax

From: Jianping Yang (Jianping.Yang@oracle.com)
Date: Fri Jun 08 2001 - 14:33:16 EDT


Carl,

Please be focused on the subject we have discussed on this email chain. If you
cannot understand the issue in this chain, just don't waste your time here.

For Oracle's naming convention, we will consider your concern here. But given
the clear definition and recommendation, I don't think user will be confused.

Regards,
Jianping.

"Carl W. Brown" wrote:

> Jianping,
>
> UTF-16 is an encoding system for Unicode. Encoding does not indicate sort
> order. It is just encoding that is all. It you want to compare two fields
> that should be compared in either the collating sequence for the locale or
> Unicode code point order.
>
> If I follow your argument further we should insure that EUC-J, Shift-JIS,
> iso-2022-jp and Unicode have the same sort order.
>
> It is not hard to compare UTF-16 data in code point sequence. DO THE RIGHT
> THING!!!
>
> What is really bad about Oracle's proposed UTF-8 implementation is that the
> incorrect encoding is called UTF8 and the real UTF-8 is called AL32UTF8. Be
> honest with your users. Let them know the real facts. If they use your
> UTF8 encoding that can get into trouble. Do it now before users get into a
> migration jam.
>
> UTF8
>
> The UTF8 character set encodes characters in one to three bytes. Surrogate
> pairs
> require six bytes.
>
> AL32UTF8
>
> The AL32UTF8 character set encodes characters in one to three bytes.
> Surrogate pairs
> require four bytes.
>
> From this documentation it would seem that UTF8 is the real thing and the
> AL32UTF8 is an Oracle special encoding.
>
> If you are going to have a non-compliant encoding that you should call it
> AL16UTF8 and call the other UTF8. This would be consistent with your
> AL16UTF16 encoding selection. You should also change the documentation to:
>
> UTF8
>
> The UTF8 character set encodes characters in one to three bytes. Surrogate
> pairs
> require four bytes.
>
> AL16UTF8
>
> The AL16UTF8 character set encodes characters in one to three bytes.
> Surrogate pairs
> use a non-standard encoding that requires six bytes. This encoding
> provides that same
> sort order as AL16UTF16 but will not work with standard UTF-8 encoders and
> decoders.
>
> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Jianping Yang
> Sent: Thursday, June 07, 2001 6:51 PM
> To: Peter_Constable@sil.org
> Cc: unicode@unicode.org
> Subject: Re: UTF-8 syntax
>
> I don't get point from this argument as UTF-8S is exactly mapped to UTF-16
> in
> UTF-16 code unit which means one UTF-16 code unit will be mapped to either
> one,
> two, or three bytes in UTF-8S. So if you are saying there is ambiguous in
> UTF-8S, it should also apply to UTF-16, which does not make sense to me.
>
> Regards,
> Jianping.
>
> Peter_Constable@sil.org wrote:
>
> > On 06/07/2001 10:38:15 AM DougEwell2 wrote:
> >
> > >The ambiguity comes from the fact that, if I am using UTF-8s and I want
> to
> > >represent the sequence of (invalid) scalar values <D800 DC00>, I must use
> > the
> > >UTF-8s sequence <ED A0 80 ED B0 80>, and if I want to represent the
> > (valid)
> > >scalar value <10000>, I must *also* use the UTF-8s sequence <ED A0 80 ED
> > B0
> > >80>. Unless you have a crystal ball or are extremely good with tarot
> > cards,
> > >you have no way, upon reverse-mapping the UTF-8s sequence <ED A0 80 ED B0
> > >80>, to know whether it is supposed to be mapped back to <D800 DC00> or
> to
> > ><10000>.
> >
> > This brings out a good point. We can't yet say that UTF-8s is ambiguous
> > since it is not formally defined. What this does highlight, though, is a
> > gap in the proposal that must be addressed before it could be considered:
> a
> > well-formed definition for UTF-8 must (by D29) provide a *unique*
> > representation for *all* USVs, and unless the proposal is amended to
> remove
> > D800 - DFFF from the codespace, it must be amended to provide some unique
> > means of representing things like U+D800. What it is *not allowed* to be
> is
> > ambiguous. If UTF-8s considers <ED A0 80 ED B0 80> to mean U+10000, then
> it
> > must provide some sequence other than <ED A0 80> to mean U+D800.
> >
> > >Premise: Unicode should not, and does not, define ambiguous UTFs.
> > > I think we agree on this.
> >
> > Yes.
> >
> > >Premise: UTF-8s is ambiguous in its handling of surrogate code points.
> > > I tried to prove this above.
> > >
> > >Conclusion: Unicode should not define UTF-8s.
> >
> > I definitely agree with the idea your getting at, but am just looking from
> > a very slightly different angle. The conclusion does not necessarily
> follow
> > because UTF-8s is only a proposal that potentially can be modified. If you
> > say, "UTF-8s as has been currently proposed would be inconsistent with
> > D29", then I agree. The proposed definition for UTF-8s *could* potentiall
> > be revised, though, and so the argument that a UTF-8s cannot be added to
> > Unicode doesn't hold.
> >
> > UTF-8s definitely is not tenable as currently proposed, given the current
> > definitions. I think we agree on that.
> >
> > - Peter
> >
> > --------------------------------------------------------------------------
> -
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <peter_constable@sil.org>





This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT