RE: UTF-8S ???

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Jun 13 2001 - 06:35:42 EDT


Markus Kuhn wrote:
> What is this UTF-8S please? The term showed up on
> linux-utf8@nl.linux.org in
> http://mail.nl.linux.org/linux-utf8/2001-06/msg00037.html

I am the guilty one. Sorry for the inconvenience, but I did it exactly for
the purpose that someone on Linux UTF-8 List asked the question you just
asked.

Short answers are all four letter long, so I'll try a longer one.

UTF-8S is a proposal by Oracle, PeopleSoft, et al. to define a new UTF. It
looks like UTF-8, but characters > 0xFFFF are represented with two 3-byte
sequences, representing the two UTF-16 surrogate codes.

The reason for this proposal, according to Oracle and PeopleSoft, is to
allow UTF-8 database to have a binary sort identical to UTF-16 databases.

So, e.g., U+10000 in UTF-8S is <ED A0 80 ED B0 80>, which is, an UTF-like
re-encoding of UTF-16<D800 U+DC00>.

Lots of people on the Unicode List are arguing that this new UTF-8S will
soon be confused with "genuine" UTF-8, and it will cause a lot of problems
to everybody.

Especially, UTF-8 to UTF-32 converters (that are at the core of, e.g.,
mbrtowc) may soon be presented with "irregular" UTF-8 data, which is
actually UTF-8S mislabeled.

To help this confusion to happen, in the Oracle database, UTF-8S is labeled
"UTF8", while genuine UTF-8 is labeled with the weird acronym "AL32UTF8".

> without any link to a proposal document. Google and Altavista
> don't know the term either.

I wish I could see such a document myself. Ken Whistler said that trying to
find out what this proposal effectively proposes is like "pulling teeth" out
of Oracle people's mouths.

> The unicode@unicode.org archive on
>
> ftp://ftp.unicode.org/Public/MailArchive/
>
> is utterly useless, [...]

This is better:

        http://groups.yahoo.com/group/unicode/messages

Warning: it is JUST an archive! Don't post there.

The thread has been renamed several times; look for subjects containing
"UTF-8s" "UTF-8 syntax" and "AL32UTF8".

> Please do not forget to *always* include the original document URL or
> similar introductory information to cross-posts to other lists! Cut &
> paste of URLs really isn't that difficult, so make a habit of it,
> please!

I wish there was some URL to cut&paste. I am afraid that you'll have to
follow the issue as it comes.

> So is this UTF-8S something useful, or just yet another
> political-correctness exercise like UTF-32 was?

Useful? The reason I cross posted the Linux list was to warn you that, if
accepted, UTF-8S could potentially undermine all the effort being made to
Unicodicize Linux!

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT