Re: Character folding in text editors

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Sat, 20 Feb 2016 20:29:36 +0100

Yes, that can be used.

Easiest is using ICU. Create a collator, using the "search" keyword. That
can be used to search for text, using settings you want for the strength
(primary differences, secondary, etc). You can also access the collation
keys from the ICU API, and build a mapping yourself of characters to
collation keys that you can use for searching with your own algorithm. That
mapping can also be used to build an equivalence class of characters that
you can pick a representative from.

If you don't use ICU, you can also use the CLDR data directly, but you'll
have to parse it yourself. You'd start with the root locale, then add in
the mappings from the children (eg de.xml). The parsing is not trivial, but
since you are only looking for equivalences (not ordering), it is somewhat
simpler.

Mark

On Sat, Feb 20, 2016 at 6:27 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> Unless we have case folding tailored by language, you cannot do that based
> on the Unicode database alone.
>
> However CLDR provides tailored data about collation.
>
> From my point of view, it is just a matter or selecting the collation
> strength to use for searches using collation. All collations in CLDR are
> locale-dependant (the search algorithm must be using either a language
> preselection, or detect the default language used by the document, or set
> explicitly in specific fragments of the document, or use some hints to
> guess what could be the effective language), even if CLDR also defines a
> "root" locale for use in language-neutral contexts, or when the language
> cannot be determined from the document or its metadata.
>
>
>
> 2016-02-20 11:23 GMT+01:00 Elias Mårtenson <lokedhs_at_gmail.com>:
>
>> Hello Unicode,
>>
>> I have been involved in a rather long discussion on the Emacs-devel
>> mailing list[1] concerning the right way to do character folding and we've
>> reached a point where input from Unicode experts would be welcome.
>>
>> The problem is the implementation of equivalence when searching for
>> characters. For example, if I have a buffer containing the following
>> characters (both using the precomposed and canonical forms):
>>
>> o ö ø ó n ñ
>>
>> The character folding feature in Emacs allows a search for "o" to mach
>> some or even all of these characters. The discussion on the mailing list
>> has circulated around both the fact that the correct behaviour here is
>> locale-dependent, and also on the correct way to implement this matching
>> absent any locale-specific exceptions.
>>
>> An English speaker would probably expect a search for "o" to match the
>> first 4 characters and a search for "n" to match the latter two.
>>
>> A Spanish speaker would expect that n and ñ be different but otherwise
>> have the same behaviour as the English user.
>>
>> A Swedish user would definitely expect o and ö to compare differently,
>> but ö and ø to compare the same.
>>
>> I have been reading the materials on unicode.org trying to see if this
>> has been specifically addressed anywhere by the Unicode Consortium, but my
>> results are inconclusive at best.
>>
>> What is the "correct" way to do this from Unicode's perspective? There is
>> clearly an aspect of locale-dependence here, but how far can the Unicode
>> data help?
>>
>> In particular, as far as I can see there is no way that the Unicode
>> charts can allow me to write an algorithm where o and ø are seen as similar
>> (as would be expected by an English user).
>>
>> [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html
>>
>>
>
Received on Sat Feb 20 2016 - 13:31:06 CST

This archive was generated by hypermail 2.2.0 : Sat Feb 20 2016 - 13:31:06 CST