Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Michael \(michka\) Kaplan (
Date: Mon May 28 2001 - 20:29:00 EDT

From: "Jianping Yang" <>

> As a matter of fact, the surrogate or supplementary
> character was not defined in the past, so we could
> live without Premise B in the past. But now the
> supplementary character is defined and will soon be
> supported, we have to bother with it.

Poor planning on a member company's part does not constitute a work item on
the UTC's part. With seven years, Oracle and PeopleSoft and everyone else
clearly could have planned here. That they did not is unfortunate, but that
they want to publicize their poor planning seems like really bad marketing
(who wants admit to being that wrong?).

So lets look at life if nothing is done by the UTC: when Oracle (for
example) emits UTF-8 data, the spec clearly allows others to accept this
data. They can therefore accept the illegal, unofficil UTF-8 that they spit
out (or belch out, or whatever). And when nice processes that were reading
the spec and thinking ahead do the right thing and then pass the information
to Oracle, it SHOULD accept the information; if it does not, then that is a
bug on their part that should be FIXED. Seems like everything will work and
they can still keep their silly illegal, unofficial encoding.

As a good example of a company that could have taken this tack, lets look at
Microsoft. Their Windows 2000/Office 2000 and all related products of that
generation had a story for UTF-8 that quite simply *sucked*, there is no
better way to put it. They allowed non-shortest form UTF-8 even for web
servers and probably did more to undermine confidence in UTF-8 than almost
any other company since it was so easy to show the security holes and buffer
overrun problems. This was clearly a case where MS sucked, and their
handling of surrogates was not much better as there were all sorts of bugs.

So what did they do.... did they come screaming to the UTC to add a new
UTF-8 form that would make their actions legal? No, they bit the buller and
made changes in over two dozen products for the next generation of their
software (Whistler/Office XP/et. al.) to fix up these issues. All of their
non-shortest form and 6-byte supplementary character situations were
considered bugs to be fixed. A lot of moaning and a bunch of work to track
down all the instances, but they did it because they can recognize when
*they* messed up and that it is their job to clean up their own mess.

So whose job precisely is it to scramble now?

(A) The Unicode Consortium itself -- the standard
(B) The companies that have been doing it all correctly
(C) The companies who have been messing it up

Clearly, (A) is an incredibly selfish idea -- asking others to live with the
bad implementation decisions of one's own company -- in this case limiting
the definition of "bad" to being "not forward-looking". (B) is not even
worth considering since they are the group who least deserves to do any
extra work, at all.

The problem lies with group (C) and that is where the responsibility to fix
it up should lie. If they have to take a slight performance hit to do their
binary sorts then I think thats a mild price to pay given they are had seven
years to plan and they decided to wait until their term paper was due before
they would scream to the instructor for an extension. PeopleSoft and Oracle
and all of the other UTF-8S fans should count themselves lucky, because when
*I* was in school, the punishment would have been a helluva lot worse,


Michael Kaplan
Trigeminal Software, Inc.

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT