Re: Re: Re: Need help on Unicode Databases.

From: Jake Morrison (Jacob.Morrison@cdc.com)
Date: Mon Jun 24 1996 - 22:44:08 EDT


> On 96.06.25 07:07, unicode@Unicode.ORG wrote:

>Jake:
>
>> Sorry, "static text" is not right. What I meant was that if you need to perform more
>> complicated comparisons on the text (substring matches, as in a LIKE SQL statement),
>> the overhead involved in UTF-8 can be significant (20 or 30% slower than with UCS-2).
>>
>> If it is a simple "yes or no" match on the text, it is not so bad. And of course,
UTF-8
>> sure beats ISO 2022.
>
>Hmm... is other timing information available? In reality (in our
>case) UTF-8 is not competing against UCS-2, but against EUC and
>ISO8859. Based on what you are telling me, I assume it's probably
>20-30% slower than ISO8859 also, but how about EUC?
>
>It seems to me that UTF-8 should not be very different from
>EUC in performance (since both are multibyte encoding). Or
>am I missing something?
>
>Thanks,
>Steve

Steve,

Sorry, I don't have any good performace data for you. I was just passing on what I heard
at the show. I'm sure that Software AG will have plenty of data showing how much faster
they are :-).

I personally have been working more with Asian character sets in X.500. For us,
searching is the easy part. The search time is overshadowed by the time required ship
the data across the network and convert it to the client's local charcter set (Big-5,
etc.).

If you are comparing UTF-8 and ISO 8859-1 or EUC with European text (only code sets 0
and 1), I would think UTF-8 would be slower. In this case it would be a multi-byte
encoding vs a single byte encoding. UCS-2 vs ISO 8859-1 would be more fair.

If it is UTF-8 vs Asian EUC, there are more variables. UTF-8 and EUC are probably about
the same search time, depending on your specific text. EUC may be available from the OS,
where UTF-8 may not be (this is changing fast). If much searching is to be done, you
would normally use EUC-fixed width encoding (wide-characters). This compares better with
UCS-2 or UCS-4.
 
There are different variations on EUC for each language. With UTF-8 (or UCS-2) you can
support all American, European and Asian text with the same search code. If you are
using Big-5 on the client (Traditional Chinese), using EUC means converting to CNS 11643
and 4-byte codes on the server. No fun :-) and forget about performance.

For us, going to Unicode on the server really simplified things. We just had to worry
about having good mapping tables on the client.

Regards,
Jake

--------
Jacob Morrison
Control Data Asia/Pacific Region E-mail: J.Morrison@twntpe.cdc.com
6/F, 131 Nanking East Road, Section 3 Voice: 886-2-715-2222 x217
Taipei, Taiwan R.O.C Fax: 886-2-712-9197



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT