Re: How to remove accents while conforming to language standards?

From: Jennifer Wong <jennifer.wong_at_workday.com>
Date: Mon, 4 Nov 2013 19:00:17 +0000

Thank you everyone for your input.

The use case is that customers want to integrate data from our enterprise solution to their ASCII-based downstream systems. Thus all accents need to be removed.

Ilay's "Transliteration on Passport" doc is very useful. We can use that as a basis to map special transliteration cases before normalizing and removing accents.

Jennifer

From: Markus Scherer <markus.icu_at_gmail.com<mailto:markus.icu_at_gmail.com>>
Date: Monday, November 4, 2013 11:54 AM
To: Jennifer Wong <jennifer.wong_at_workday.com<mailto:jennifer.wong_at_workday.com>>
Cc: "unicode_at_unicode.org<mailto:unicode_at_unicode.org>" <unicode_at_unicode.org<mailto:unicode_at_unicode.org>>
Subject: Re: How to remove accents while conforming to language standards?

Hi Jennifer,

On Fri, Nov 1, 2013 at 8:37 AM, Jennifer Wong <jennifer.wong_at_workday.com<mailto:jennifer.wong_at_workday.com>> wrote:
I would like to ask for advice on removing accents from characters. While the normalization process is straight forward (NFD, remove accents), it does not take into account of special cases. For example, Danish, "å" should be mapped to "aa", not "a". Likewise, in German, "ä" "ö" "ü" should be mapped to "ae", "oe" and "ue" respectively, not "a", "e", "u". Are there common practices on how to handle these special cases? Thank you.

Can you describe what your use case is?

One possible area that appears not to have been discussed yet is sorting of strings and full-text search (as in ctrl-F in a browser or word processor). If you are after those, then please look for "unicode collation" and "cldr collation". The ICU libraries<http://userguide.icu-project.org/collation> might also help.

Best regards,
markus
Received on Mon Nov 04 2013 - 13:02:22 CST

This archive was generated by hypermail 2.2.0 : Mon Nov 04 2013 - 13:02:23 CST