Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: DougEwell2@cs.com
Date: Tue May 29 2001 - 00:52:44 EDT

Next message: DougEwell2@cs.com: "Re: Unicode-based Cyrillic-Latin transliteration table"
Previous message: sheng: "unscribe"
Next in thread: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: Kenneth Whistler: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 2001-05-28 13:56:50 Pacific Daylight Time,
cbrown@xnetinc.com writes:

> The problem with databases is that you have to have a locale independent
> sorting sequence. If you store a record with a key built with one locale,
> you might not be able to retrieve it using another locale sort sequence.

OK, now I think I understand this particular, specific need for a straight
binary sorting order. As long as this stays internal and doesn't filter down
to users, where they will see Z before A-acute and all the Latin-1 characters
before A-macron, there is no problem... so far.

> The problem is that [UTF-8] is wasteful of space. For CLOBs where the
field are
> very large allocating 4 bytes per character, wastes space so they used
> UCS-2. Converting from UCS-2 to UTF-16 creates a sorting problem. UTF-16
> keys and UTF-8 keys have different sorting sequences.

Converting from UCS-2 to UTF-16 should not create a sorting problem, because
the only difference between the two is that UCS-2 is ignorant of surrogates
while UTF-16 is aware of them. This is where the straight binary sorting
order, as valid as it may be for locale independence, needs to be modified;
it needs to take surrogates into account.

> UTF-8s would have put the entire surrogate support into the hands of the
> application.

Which I don't necessarily think is such a hot idea. The mechanics of
different encoding forms, surrogates, combining characters, and other Unicode
details should be handled as early in the chain as possible, so applications
can just deal with "characters."

> Converting UCS-2 to UTF-16 support is a lot of work because most to
> operation are actually using UTF-32. This will match UTF-8 sorting.

As Michka observed, this may be "a lot of work" but it has to be done. It
could have been anticipated many years ago, and it is the right way to solve
the problem. Asking the standardizers to introduce a new hack to compensate
for industry's overreliance on the mechanical details of a previous hack is
the wrong way to solve the problem.

-Doug Ewell
Fullerton, California

Next message: DougEwell2@cs.com: "Re: Unicode-based Cyrillic-Latin transliteration table"
Previous message: sheng: "unscribe"
Next in thread: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: DougEwell2@cs.com: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: Peter_Constable@sil.org: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Maybe reply: Kenneth Whistler: "Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT