RE: Unicode->ASCII approximate conversion

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Fri Dec 19 2003 - 07:01:59 EST

Next message: jon@hackcraft.net: "Re: Unicode->ASCII approximate conversion"

Previous message: Philippe Verdy: "RE: Unicode->ASCII approximate conversion"
Maybe in reply to: Hallvard B Furuseth: "Unicode->ASCII approximate conversion"
Next in thread: jon@hackcraft.net: "Re: Unicode->ASCII approximate conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hallvard B Furuseth wrote:
> I need a function which converts Latin Unicode characters to
> the closest equivalent ASCII characters, e.g. "é" -> "e".
>
> Before I reinvent the wheel, does any public domain or GPL
> code for this already exist?

I don't know, sorry.

> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

Why the name!?

The decomposition property (5th filed on each line) is much better for this.
E.g.:

00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301;;;;N;LATIN
SMALL LETTER E ACUTE;;00C9;;00C9

The decomposition field tells you that "é" (code 00E9 hex) is composed of
ASCII "e" (code 0065 hex) and the combining acute accent (code 0301 hex):
you keep the ASCII character and drop the composing accent.

> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.

The result is much better if you allow the ASCII conversion to be a string.
This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
good for letters: "ß" = "ss", "å" = "aa", etc.

_ Marco

Next message: jon@hackcraft.net: "Re: Unicode->ASCII approximate conversion"
Previous message: Philippe Verdy: "RE: Unicode->ASCII approximate conversion"
Maybe in reply to: Hallvard B Furuseth: "Unicode->ASCII approximate conversion"
Next in thread: jon@hackcraft.net: "Re: Unicode->ASCII approximate conversion"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 19 2003 - 07:41:05 EST