Re: How to remove accents while conforming to language standards?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Fri, 1 Nov 2013 20:52:58 +0000

On Fri, 1 Nov 2013 15:37:22 +0000
Jennifer Wong <jennifer.wong_at_workday.com> wrote:

> I would like to ask for advice on removing accents from characters.

Don't do it.

> While the normalization process is straight forward (NFD, remove
> accents), it does not take into account of special cases. For
> example, Danish, "" should be mapped to "aa", not "a". Likewise, in
> German, "" "" "" should be mapped to "ae", "oe" and "ue"
> respectively, not "a", "e", "u". Are there common practices on how to
> handle these special cases? Thank you.

There are numerous ASCIIfication conventions, generally of limited
extent. For example, while the Romanian telegraphic convention would
turn a squiggle below into a 'z', one ASCIIfication of Sanskrit ''
would use 's' followed by apostrophe and the academically dominant
method, the Harvard-Kyoto convention, would use 'z'.

It may be worth mentioning that combining marks can be of equal rank
with the base characters. Stripping the vowel marks from text in an
Indic script is as acceptable as stripping the vowels from English.
Also, I find it hard to believe that anyone but a Tamil would consider
a consonant-vowel combination in an Indic script a single character.

I was intensely annoyed to find LibreOffice treating a <consonant,
virama, consonant> combination as a single character for editing
purposes; I had to resort to a regular expression search and replace
operation to insert a space after the virama. It could be worse - on
Ubuntu 12.04 gnome-terminal and xterm, typing <THAI CHARACTER BO
BAIMAI, THAI CHARACTER SARA I, rubout> results in no net character
input!

Richard.
Received on Fri Nov 01 2013 - 15:55:54 CDT

This archive was generated by hypermail 2.2.0 : Fri Nov 01 2013 - 15:55:57 CDT