RE: Unicode->ASCII approximate conversion

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Dec 19 2003 - 07:01:59 EST

  • Next message: jon@hackcraft.net: "Re: Unicode->ASCII approximate conversion"

    Hallvard B Furuseth wrote:
    > I need a function which converts Latin Unicode characters to
    > the closest equivalent ASCII characters, e.g. "é" -> "e".
    >
    > Before I reinvent the wheel, does any public domain or GPL
    > code for this already exist?

    I don't know, sorry.

    > If not,
    > for the most part I expect I can make the mapping from the character
    > names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
    > in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

    Why the name!?

    The decomposition property (5th filed on each line) is much better for this.
    E.g.:

            00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN
    SMALL LETTER E ACUTE;;00C9;;00C9

    The decomposition field tells you that "é" (code 00E9 hex) is composed of
    ASCII "e" (code 0065 hex) and the combining acute accent (code 0301 hex):
    you keep the ASCII character and drop the composing accent.

    > Punctuation and other non-letters will be worse, but they are less
    > important to me anyway.

    The result is much better if you allow the ASCII conversion to be a string.
    This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
    good for letters: "ß" = "ss", "å" = "aa", etc.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 07:41:05 EST