RE: A modest proposal for UTF-8s

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Wed Jun 13 2001 - 14:33:51 EDT


> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Carl W. Brown
> Sent: Wednesday, June 13, 2001 9:40 AM
> To: unicode
> Subject: A modest proposal for UTF-8s
>
>
> What do you really want? A UTF-8 like encoding with UTF-16 sorting.
>
> The major problem with UTF-8s as it is currently proposed is that it
> violates the basic tenet of all UTF encodings that you can determine
> character length from the first part of the encoding. This is
> true of most
> character sets. With the current UTF-8s proposal characters starting with
> ED can be either 3 or 6 bytes long. Encodings like iso-2022 are almost
> worthless for anything other than data transport. If you want to actually
> manipulate the data, you have to transform it to a better encoding. Even
> strncpy type functions should only copy complete characters.
> There is very
> little you can do with data without accurate character boundary
> information.
>
> If you have a UTF-8s you will also need a UTF-32s. Most of the newer
> wchat_t implementations are using 4 byte wide characters. So it
> makes more
> sense to design the UTF-32s first. If we make the assumption that UTF-32s
> will take all code points above U+DFFF and shift them after plane
> 16 then we
> have a way to encode them as a single UTF character.
>
> UTF-16 UTF-32s UTF-8s
>
> E000 00110000 F4908080
>
> When use UTF-8s? If the goal is the have the UTF-8s be used
> internally in a
> product like Oracle so that it sorts the same as UTF-16 but that the data
> will actually be retrieved in UTF-16 sequence there is no issue
> because the
> encoding will only be used internally. If the UTF-8s to transformed to
> UTF-8 for I/O again we have no issues here.
>

Sorry it should read:

When do you use UTF-8s? If the goal is the have the UTF-8s be used
internally in a product like Oracle so that it sorts the same as UTF-16 but
that the data
will actually be retrieved in UTF-16 sequence there is no issue because the
encoding will only be used internally. If the UTF-8s is transformed to
UTF-8 for I/O again we have no issues here.

> If the user actually wants UTF-8s input and output streams then
> the question
> is why? Why should it look anything like UTF-8. It is not
> interchangeable
> with UTF-8. You can not send it to a browser or even use the UTF-8 string
> handling routines to manipulate the data. If you want to use any OS UTF-8
> functions with UTF-8s it will not work.
>
> If you intend to cheat and say that you intend to limit your characters to
> plane 0 characters this is not only a GROSSE VIOLATION if the standard but
> ironically it makes the argument for UTF-8s go away because without non
> plane 0 characters they sort the same.
>
> The big question is what good is UTF-8s as proposed? What can you do with
> it? Why would you want it?
>
> What I picture is that problem is a situation like this. You have a Sun
> Solaris server with an Oracle database. You know that the wchar_t
> implementation is not Unicode so you would like to use the UTF-8 services.
> You figure that UTF-8s will look enough like UTF-8 that it will fool the
> UTF-8 to UTF-16 converter. You ship the data as UTF-16 to your Windows
> client and everything works. The problem is that you need the same sort
> sequences on your client code as your database.
>
> What is missing is that the UTF-8 services will break with UTF-8s
> data. In
> actuality you will be just as messed up using the Sun wide
> character support
> with your data in UTF-16 encoding as you will be using UTF-8s.
> In actuality
> you have little choice in the matter. At this stage of the game the only
> real solution is a cross platform Unicode support package like
> ICU. This is
> why I am dedicating man months of pro bono work to make ICU easier to
> implement for both new and existing applications. We really don't need a
> new encoding, we need good software to implement what we have. After more
> than 15 years of fighting code pages, I see Unicode as the only way to go.
> I will do what I can to see Unicode truly succeed.

An easier solution is to maintain the database is Unicode code point order.
It is not too difficult to insist that the database provide this facility.
You also need a special compare routine for UTF-16 that compares in code
point order. Then everything works.

>
> Carl
>
>
>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT