Re: Compatibility decomposition for Hebrew and Greek final letters from Markus Scherer on 2015-02-20 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Fri, 20 Feb 2015 09:49:20 -0800

On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii <eliz_at_gnu.org> wrote:

> I think decomposition to NFKD solves these issues, doesn't it?
>

Not completely. Judging from your question, you expected more mappings than
NFKD has. You might want to try the mappings that are used as input for
deriving the DUCET (default Unicode collation):
http://www.unicode.org/Public/UCA/latest/decomps.txt

For a character-based search, you should still try to work with canonical
equivalence, for example by applying the FCD check and normalizing when
that fails. http://www.unicode.org/notes/tn5/

Thanks. I've studied that already, and I do know that collation data
> can be used for search. But it's still a lot of data that I'd like to
> avoid loading, if possible.
>

Sure, as I said, it depends on what you need and want.

FYI, the ICU data file corresponding to the DUCET is about 160kB (for UCA
7.0) and could be reduced if limited to one specific use case, but the
collation and string-search code is large and complex.

Best regards,
markus

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Feb 20 2015 - 11:50:50 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 20 2015 - 11:50:50 CST