Re: Multi language in the same database

From: Linus Toshihiro Tanaka (ttanaka@us.oracle.com)
Date: Thu Mar 16 2000 - 22:32:56 EST


> I am figuring out on how to manage a multi-language row data in the
> same table of the same database. That is, assuming we are using
> Unicode UTF-8 for storage.

Various database vendors provide Unicode solution. For example, you can
choose the database character set UTF8 in Oracle8 and Oracle8i.

> Say we have a column of Gender which can contain the agenda of a
> customer in different language base (catering for regional needs) for
> eg.
>
> Now, if we need to do a sorting of the column, how do we do it

Before sending the retrieved data to the client or middle-tier, you
probably want to sort the result inside the database first (otherwise,
you have to sort the entire data in the client or middle-tier). Once
the result is sorted in the database, you can retrieve the first 100
rows of the already sorted data, for example, to the client or
middle-tier.

If you are using Oracle, linguistic sorting order is determined based on
your client setting (you can dynamically change it, too). If you are
using Oracle8i 8.1.5 or 8.1.6, you can create multiple linguistic
indexes (one per language) to improve the performance of the linguistic
sort.

> Isn't it cleaner to have separate tables for UTF-8 format but can use
> the same table for UTF-16/UCS-2 for fixed width format?

For the database storage encoding, some vendors offer UTF-8, some other
vendors offer UTF-16/UCS-2. Even when storage encoding is UTF-8, some
UCS-2 APIs may exist. It's not an easy question which is better or
worse. The answer depends on what programming languages you use,
characters of what languages you want to store, etc.

If you think of storage efficiency, UTF-8 is a very good choice as a
storage encoding if the data are mostly in European languages of Latin
script. UTF-8 takes about same space as UCS-2 or UTF-16 for Greek,
Russian, Arabic, Hebrew and some other languages. UTF-8 takes more
space than UCS-2 or UTF-16 only if more than 40% (?) of the data are in
Chinese, Japanese, Korean, Thai, Indic, Dravidian or some other
languages.

+----------------------------------------------------------------+
| Linus Toshihiro Tanaka 500 Oracle Parkway M/S 4op7 |
| NLS Consulting Team Redwood Shores, CA 94065 USA |
| Server Globalization Technology email: ttanaka@us.oracle.com |
| Oracle Corporation |
+----------------------------------------------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT