I am trying to build a Unicode-based transliteration table from Cyrillic to
7-bit ASCII and would like to request the assistance of the Unicode list
The goal is to improve an existing program I wrote which automatically
detects the encoding form of Cyrillic text (8-bit character sets such as DOS
CP 866, Windows CP 1251, or KOI-8, as well as UTF-8) and optionally
transliterates the text to a 7-bit ASCII representation that an English
speaker can reasonably sound out.
Lots of Cyrillic-Latin transliterations are available for the Russian
alphabet. I am looking for one that targets only the 7-bit ASCII set, which
rules out ISO 9, and covers more than just Russian, which rules out many
others. One-to-one correspondence between letters is not a goal; it is
perfectly OK to transliterate U+0429 to "SHCH". Likewise, round-tripping is
not a goal; U+0428 + U+0427 and U+0429 would both be expected to map to
What I do want is something that generates a usable pronunciation without
using digits or letters like Q for no purpose other than uniqueness, and
which is based on Unicode values and addresses as much of the Unicode
Cyrillic block as possible, including the new characters planned for Unicode
3.2 if possible.
I know that neither UTC nor WG2 engages in the very controversial business of
assigning canonical transliterations between scripts, and I am not asking
them to. I would like the private assistance of capable list members in
providing an unofficial solution.
Part of a possible transliteration table might look like the following:
If anyone would like to help, please e-mail me privately at DougEwell2@cs.com
, or you can write to the list if you feel your response would be of interest
to the list at large.
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT