Re: Unicode->ASCII approximate conversion

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri Dec 19 2003 - 08:10:31 EST

  • Next message: D. Starner: "RE: Unicode->ASCII approximate conversion"

    On Fri, 19 Dec 2003 jon@hackcraft.net wrote:

    > Quoting Hallvard B Furuseth <h.b.furuseth@usit.uio.no>:
    >
    > > I need a function which converts Latin Unicode characters to the closest
    > > equivalent ASCII characters, e.g. "é" -> "e".

    > 1. Produce the NFD normalisation of the text.
    > 2. Remove all characters with a non-zero combining class.
    > 3. Some non-ASCII characters may remain (particularly those from non-Latin
    > scripts) handling of some can be done nicely, but some may require you to
    > raise an exception or output a replacement character.

    > on your application. Specialised handling of some characters is possible, for
    > instance you could convert the trademark sign to "(TM)" to avoid confusion,

      For Korean syllables (U+AC00 - U+Dxxx), you can use 'Hangul Syllable
    Short Names' that can be algorithmically derived with small tables.



    This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 09:00:12 EST