RE: Transliteration

Date: Mon Mar 06 2000 - 16:48:39 EST

You can see (or checkout) the data files online:


For example, the data file for the Greek transliteration rules and basic
locale data are at the following two locations:

(if your emailer wraps these lines you might have to reconstruct the URLs.)

The source format for the locale data is not XML, having predated it.
However, it would be a trivial matter to convert it, and we are looking at
using XML for the source format in the future.

Mark Davis, IBM Center for Java Technology, Cupertino
(408) 777-5850 [fax: 5891],, on 2000.03.06 13:03:57

To: Mark Davis/Cupertino/IBM@IBMUS
Subject: RE: Transliteration

(this message has little to do with Nokia...)

I haven't yet looked closely at ICU but it certainly looks very
mostly because I've been contemplating doing something very similar myself.

My arena, however, is Perl (I am one of the core developers of Perl). What
I had planned doing was of course something slightly less ambitious, and
(I think, I must say "I think" because as I said I still haven't looked
at ICU, only reads its docs and tried out the locale browser) more modular,
in that I would have had a separate Perl module that would have held only
the names of the weekdays and months (in UTF8) for the various languages,
and that would have been completely separate from, say, a collation module.
(How would I have received the data? Well, manually and by
it would have been a long slow project.)

Now I'm interested in what kind of a format is the ICU data represented?
If the datafiles were XML in UTF8, well, they could be easily used from
any programming language that can parse UTF8 XML. Whether they would
prefer using some native binary compact databases, well, that is their
concern. But having the datafiles (plus rules like the transliteration
rules) easily available and separate from the APIs would be great.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT