Re: That UTF-8 Rant

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Jul 23 1999 - 03:21:10 EDT


At 05:38 PM 7/22/99 -0700, Gary Roberts wrote:
>
>
>On Thu, 22 Jul 1999, Markus Kuhn wrote:
>
>> Actually, I happen to be extremely interested in exactly these
>> questions, because I happen to be someone who makes implementation
>> decisions about databases that could one day grow into the
>> hundreds-of-gigabyte range. I have not yet seen multi-terabyte plain
>> text databases though (perhaps the email/fax eavesdroppers at the NSA
>> have these, if anyone ;-), these tend more to be filled with images and
>> not text.
>
>We have many customers with multi-terabyte databases. Our Japanese
>customers in particular have claimed a high percentage of character data
>(The rest is almost entirely numeric). Our Unicode (UTF-16)
>implementation is criticized as being inefficient in storage relative to
>Shift-JIS (which we also support). I suspect a UTF-8 implementation would
>be unpopular.
> *

I wonder why you don't support SCSU. You can actually get more compact
Japanese (relative to Shift-JIS, UTF-16 and UTF-8), with not much more
computation that for UTF-8 - as long as you can live with unpacking the
data into UTF-16 during processing, as many people do with UTF-8. (SCSU is
stateful and not self-synchronizing like UTF-8, therefore doesn't suport
random access into the middle of a field.).
>
SCSU = Standard Compression Scheme for Unicode. See unicode report 6 for
details: http://www.unicode.org/unicode/reports/tr6

A./



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT