Fw: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Mark Davis (mark@macchiato.com)
Date: Wed Jun 13 2001 - 11:10:16 EDT

Next message: Mark Davis: "Fw: UTF-8 Syntax"
Previous message: Mark Davis: "Fw: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Mark Davis" <markdavis34@home.com>
To: <Peter_Constable@sil.org>; <unicore@unicode.org>
Sent: Wednesday, June 06, 2001 07:51
Subject: Re: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> I agree with Peter that Mori's argument on this point doesn't hold water.
We
> can conceive of two order-consistent groups of UTFs
>
> UTF-8
> UTF-16x
> UTF-32
>
> UTF-8s
> UTF-16
> UTF-32s
>
> [UTF-16x is one that is rearranged as described in many previous emails,
so
> that the surrogate blocks are at the top]
>
> Within each group of three, order is preserved. Within all 6 forms, all
code
> points are preserved (except a sequence of high/low surrogate).*
>
> We absolutely *cannot* change the definition of UTF-8 to be UTF-8s. That
has
> never in question! Neither is it possible to *change* the definition of
> UTF-16 to be UTF-16x. No possibility at all. Not worth discussing.
>
> The most we are ever discussing is whether to *add* one or more of UTF-8s
> and UTF-32s, as *distinct* UTFs, with distinct names.
>
> Mark
>
> * Note: there is a caveats to code point preservation. Although the
> discussion on p46 says that you must preserve unpaired surrogates and
> noncharacters, the amendment to C10 in TR27 says that you are not required
> to preserve noncharacters. (You could also replace sequences by
> canonically-equivalent sequences.)
>
> ----- Original Message -----
> From: <Peter_Constable@sil.org>
> To: <unicore@unicode.org>
> Sent: Wednesday, June 06, 2001 06:32
> Subject: Re: AW: Fwd: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
>
>
> >
> > On 06/06/2001 03:39:33 AM Mori, Nobuyoshi wrote:
> >
> > >The message I wished Unicode experts to consider was:
> > >
> > >Unicode provides a unique number for every character. We have 3
encoding
> >
> > >schemes, UTF-8, UTF-16 and UTF-32. The 3 UTF-n variations are merely
> > >mathematical transformations of Unicode and should be compatible in
each
> > >other. Both the amount and ***order*** of Unicode characters are
nature
> > of Unicode
> > >must be preserved during UTF-n conversions.
> > >UTF-8s/32s suggestion is a serious trial to fix the broken order among
> > various
> > >UTF-n's.
> >
> > [emphasis added]
> >
> > There is an assumption that order must be preserved across encoding
forms.
> > This is "Premise B" in the following message which I posted to the other
> > list last week and am copying for the benefit of those that may not have
> > seen it. I have already shown that it is impossible to make Premise B
hold
> > without taking drastic measures, and feel I have given good arguments as
> to
> > why attempts to uphold Premise B do not accomplish what they purport to
> do.
> > There has not yet been any response to those arguments, and several
others
> > presented since on unicoDe, including in this recent post. I do not see
> how
> > a case for UTF-8s/-32s can logically be made unless the argument I
> > presented is shown to be invalid.
> >
> >
> > >I read a lot of emotional and strong words against the proposal.
> >
> > My words in the following post are not particularly emotional. There are
> > basically a matter of logical reasoning (at least, in the Western
> tradition
> > of logic). They are strong insofar as the argument presented is, I
think,
> > strong.
> >
> >
> > - Peter
> >
> >
> >
> >
> >
> > -------------------------------
> >
> > >If you think something abominable is happening, please raise a loud
voice
> > >and flood UTC members with e-mail and tell everyone what you think and
> why
> > >you think it. Nobody can hear you when you mumble.
> > >
> > >And it helps if you have solid technical and philosophical arguments to
> > convey.
> >
> > Well, I wasn't going to elaborate (just been through this elsewhere)...
> >
> > The Unicode flavour of UTF-8 only allows for sequences of up to four
code
> > units in length to represent a Unicode character, in contrast to ISO's
> six,
> > the difference having to do with Unicode having limited the codespace to
> > U+10FFFF (whereas ISO 10646 formally includes a codespace up to
> U+7FFFFFFF,
> > but will be effectively restricting use to U+10FFFF).
> >
> > BUT...
> >
> > >There was another abomination proposed. Oracle rather than adding
UTF-16
> > >support proposed that non plane 0 characters be encoded to an from
UTF-8
> > by
> > >encoding each of the surrogate pairs into a separate UTF-8 character.
> >
> > Yes, Oracle, PeopleSoft and SAP submitted a proposal to UTC to sanction
> > another encoding form, UTF-8S, that would encode supplementary plane
> > characters as six bytes, three corresponding to each of a UTF-16 high
and
> > low surrogate. The rational had to do with having an 8-bit encoding form
> > that would "binary" sort in the same way as UTF-16.
> >
> > (Warning: This gets a bit long. I'm doing this because I was advised not
> to
> > mumble but to speak up. :-)
> >
> >
> > The issue is this: Unicode's three encoding forms don't sort in the same
> > way when sorting is done using that most basic and
> > valid-in-almost-no-locales-but-easy-and-quick approach of simply
comparing
> > binary values of code units. The three give these results:
> >
> > UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> > UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> > UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> >
> > This is seen by the proposers to be a problem: if you have data from one
> > source in UTF-8 and another in UTF-16 and you sort the two, you'd like
to
> > be able to compare results from each source and know that you're sorting
> > things that are comparable. By using a UTF-8 variation ("UTF-8S" in
which
> > supplementary-plane characters are mapped first to UTF-16 surrogates and
> > from there to 8-bit code unit sequences), then the resulting ordering
is:
> >
> > UTF-8S: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> > UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> > UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> >
> > The implication would be that we'd have *two* 8-bit encoding forms.
> >
> > A suggestion was made on the unicoRe list by one of the proponents that
> all
> > encoding forms should "binary sort" in the same way; I found it
surprising
> > that they were proposing a variant of UTF-8 rather than UTF-16, since
> > UTF-16 was the odd one out, and tweaking UTF-8 still leaves one encoding
> > form that "binary sorts" differently: UTF-32. Well, I was aghast when I
> > read the actual proposal to see that, not only are they suggesting that
we
> > have a second 8-bit encoding form, "UTF-8S", but they also want to have
> > another 32-bit encoding form "UTF-32S". So, we'd end up with a total of
5
> > encoding forms: UTF-8, UTF-8S, UTF-16, UTF-32, UTF-32S.
> >
> > According to the proposal, UTF-8S and UTF-32S would not have the same
> > status: they wouldn't be for interchange; they'd just be for
> representation
> > internal to a given system, like UTF-EBCDIC (which, I think I heard, has
> > not actually been implemented by IBM in any live systems).
> >
> > What I don't get is this: if you want to implement something just inside
> > you're own system and you say you'll make sure nobody else ever sees it,
> > why do you need UTC to sanction it in any way?
> >
> > The crux of the justification offered by Oracle et al is this argument,
> > which appears to me to be fallacious:
> >
> > <quote>
> > Specifically, any system that must deal with the indexing and comparison
> of
> > large collections of data across multiple encodings will run into the
> issue
> > that data that is ordered based on the binary values in one encoding
will
> > no longer be ordered such once transformed into another encoding. While
> > this lack of binary ordering compatibility across different encodings is
> > very true and well-understood in the world of legacy encodings (such as
a
> > transcode of Shift-JIS to EUC-JP), given that all the Unicode
> > Transformation Forms are maintained by a single committee, it should be
> > possible to come up with a common binary order between each of the three
> > main Unicode Transformation Forms.
> > </quote>
> >
> > Summarising:
> >
> > Premise A: the three main Unicode encoding forms are maintained by a
> single
> > committee
> > Claimed implication C: it should be possible to come up with a common
> > binary order between each of the three Unicode encoding forms
> >
> > There is a missing and implied premise that is needed to make the
> > implication work:
> >
> > Premise B: encoding forms maintained by a single committee should all
> yield
> > a common binary order.
> >
> >
> > This argument seems to me to be faulty in at least two ways:
> >
> > First, it is clearly counterexemplified by existing situations
> > - e.g. existing Microsoft codepages (euro binary sorts before ellipsis
in
> > cp1252 but after it in cp1251)
> > - I'm sure it wouldn't be hard to produce counterexamples from the work
of
> > JTC1/SC2
> >
> > Now, I think Oracle et al would offer as a rebuttal that Premise B was
not
> > true in the past, but that it should be for Unicode. They offer no real
> > argumentation as to why this should be the case, though. They simply
> assume
> > that this will be easier for everyone (not at all obvious). I'll revise
> > Premise B to reflect this:
> >
> > Premise B (rev'd): Encoding forms maintained by UTC should all yield a
> > common binary order.
> >
> > This is important, and will come back up later.
> >
> >
> > Secondly, the argument presupposes that the desired result as described
in
> > C is (i) possible and (ii) achieved by their proposal. Their proposal
does
> > not cause the three main encoding forms to yield a common binary order;
> > what their proposal does is introduce two new encoding forms (and give
> them
> > a somewhat ambiguous status) that will share a common binary order with
> > UTF-16. The existing encoding forms remain, and continue not to share a
> > common binary order. I think it is self-evident that the desired result
> > can, in fact, only be achieved in one of the following ways:
> >
> > Drastic Measure 1: make UTF-16 obsolete; replace it with UTF-16a which
> > binary sorts supplementary-plane characters after U+E000..U+FFFF
> >
> > Drastic Measure 2: make UTF-8 and UTF-32 obsolete; replace them with
UTF-8
> > and UTF-32.
> >
> > These are, of course, impossible (and fortunately Oracle et al are not
> > proposing either of these).
> >
> >
> > Now, I've argued against a strict interpretation of what they said.
Let's
> > consider the spirit of what they're saying: to store data internally
using
> > UTF-8S or UTF-32S so that they can compare binary-sort results with data
> > sources encoded using UTF-16. The proposal involves encoding forms with
> > very ambiguous, quasi-official status:
> >
> > <quote>
> > This paper... proposes to add at least one optional additional UTF, in
the
> > form of a Unicode Technical Report (UTR). This form could be implemented
> > by system designers where the benefit of a common binary ordering across
> > different UTFs is important from a performance standpoint, or for other
> > reasons. The new UTF(s) would have equivalent standing as the UTF-EBCDIC
> > transformation currently maintained in UTR#16. It is not proposed that
the
> > new transformation form(s) become Standard Annexes (UAX), nor would they
> be
> > proposed for inclusion in ISO 10646.
> > </quote>
> >
> > For sake of argument, I'll ignore this for the moment. They offer some
> > usage scenarios; I'll quote an excerpt of only the first (the other adds
> > nothing new to the argument for or against):
> >
> > <quote>
> > UTF-8 database server �� UTF-16 database client
> >
> > A SQL statement executed on the database server returns a result set
> > ordered by the binary sort of the data in UTF-8, given that this is the
> > encoding of both data and indexes in the database.
> >
> > A C/C++ or Java UTF-16-based client receives this result set and must
> > compare it to a large collection of data stored locally in UTF-16...
> > </quote>
> >
> > This has to assume a closed system in which the server and client are
> > proprietary solutions using proprietary protocols for their interaction.
I
> > say this because both are assuming Unicode is always binary sorted in an
> > order that results from UTF-16, and that's a proprietary assumption. To
> > make it otherwise either would require obsoleting UTF-8 and replacing it
> > with UTF-8S, or else would require making UTF-8S an *official Unicode
> > standard* protocol. So, they can't waffle on the status. If they don't
> want
> > real official standard status for this, then so much for open solutions
in
> > which my client can talk to your server, or vice versa.
> >
> > If they want to do this in a closed system, they can already just go
ahead
> > an do it; they don't need UTC to give permission for what they do inside
> > their own systems. By proposing that this be documented and given a
name,
> > evidently they want to be able to share the assumptions involved with
> > others, i.e. do this in an open context. Thus, even if they don't call
it
> a
> > "standard Unicode encoding form", they're trying to treat it as such.
So,
> > it seems to me that this proposal really is asking us to create new,
> > standardised encoding forms that need to be documented as UAXs. Either
> that
> > or to adopt Drastic Measure 1 or 2. (I don't think DM1/2 would be
> > considered for a moment by anybody, and Oracle et al explicitly rule
that
> > out in their proposal.)
> >
> > So, we're left with them asking us all to adopt a couple of additional
> > standard encoding forms. Do we really want five encoding forms (and
eleven
> > encoding schemes)?
> >
> > Even if we go along with this, there's still a problem: That UTF-8 DB
> > server in the scenario above (assuming a non-closed system) might be
using
> > true UTF-8, or UTF-8S. (I'm sure there must be existing implementations
of
> > clients or servers using UTF-16 and of clients or servers using real
> > UTF-8.) So, the proposal require not only two new encoding forms; in
> > addition the following are also necessary:
> >
> > - A way to communicate between client and server what binary sorting
> > assumptions are being made.
> > - Both the client and server *still* need to be able to handle the
> > situation in which one is using UTF-16 and the other is using true
UTF-8.
> >
> > So, the proposed solution (following the spirit of the proposal) doesn't
> > eliminate the problem.
> >
> >
> > To summarise:
> >
> > - Oracle et al want UTC to sanction two new encoding forms.
> >
> > - These encoding forms would supposedly have some kind of ambguous,
> > quasi-official status.
> >
> > - Making the proposal accomplish anything in open systems really
requires
> > that the encoding forms have official standard status.
> >
> > - Even so, the proposal does not eliminate the problem that it is
supposed
> > to be addressing.
> >
> > - The problem as stated (assuming Premise B) cannot be eliminated in
open
> > systems without taking very drastic and impossible measures.
> >
> > - The problem can be solved in closed systems without needing new
encoding
> > forms sanctioned by UTC.
> >
> >
> > This whole basis of the problem hinges on Premise B. If we maintain
> Premise
> > B, then we end up with a situation that can in principle only be solved
in
> > closed systems and, as such, don't require any new UTC-sanctioned
encoding
> > forms (with whatever status). The attempt to solve the problem does not,
> in
> > fact, eliminate the problem, and gives us new encoding forms to worry
> > about.
> >
> > The alternative is to reject premise B. That seems to me to be *a whole
> > lot* cleaner and easier.
> >
> > The main point seems to be that Oracle et al want to maintain Premise B,
> > presumably because they think it would be easier. Yet I think I've shown
> > that it doesn't, both because it creates new encoding forms to deal
with,
> > and because we still have to deal with the reality that the original
> > encoding forms still exist. Now, I'm not a database developer, so I need
> to
> > be careful since I can't presume to know the particular implementation
> > needs of such environments. But it seems to me that we've lived without
> > Premise B in the past, and that it won't benefit us to adopt it now. Why
> > bother with it? Why not continue doing what we already know how to do?
> >
> > The only possible answer I can think of is out of concern that, in the
> case
> > of Unicode, some implementers may *assume* premise B to be true. Our
> > options, therefore, are twofold: to make Premise B in fact true -- but
> > we've seen that that's the harder road and doesn't benefit us after
all --
> > or to make people understand that Premise B is false. People already
need
> > to learn about Unicode in order to implement it; why can't they also
learn
> > that Premise B is false? (This seems too easy; I must be missing
> > something.)
> >
> >
> >
> > - Peter
> >
> >
>
> --------------------------------------------------------------------------
> -
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <peter_constable@sil.org>
> >
> >
> >
> >
> >
>
>

Next message: Mark Davis: "Fw: UTF-8 Syntax"
Previous message: Mark Davis: "Fw: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT