Re: "textels" from Eric Muller on 2016-09-16 (Unicode Mail List Archive)

From: Eric Muller <eric.muller_at_efele.net>
Date: Fri, 16 Sep 2016 08:47:27 -0700

On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
> Quote/Cytat - Eric Muller <eric.muller_at_efele.net> (pią, 16 wrz 2016,
> 17:03:54):
>
>> On 9/16/2016 6:52 AM, Janusz S. Bień wrote:
>>> (when working on a corpus of historical Polish we
>>> noticed some cases where standard Unicode equivalence was not
>>> convenient).
>>
>> I'm very interested to know more about those cases.
>
> For our search engine we were unable to use compatibility equivalence
> "out of the box" for splitting the ligature because it also converted
> long s to short s while we wanted to preserve the distinction.

I am interested in the problems with *canonical* equivalence. I thought
that you were talking about those before.

Compatibility equivalence is a completely different beast. It is, IMHO,
too coarse a tool and best forgotten. For any particular task, it's
typically doing too much (e.g. long/short s folding in your case) and
too little (not everything you need). There was an attempt at improving
the situation, by providing a whole bunch of fine grained, targeted
transformations (http://www.unicode.org/reports/tr30/), but that did not
pan out.

Eric.

Thanks,
Eric.
Received on Fri Sep 16 2016 - 10:48:25 CDT

This archive was generated by hypermail 2.2.0 : Fri Sep 16 2016 - 10:48:25 CDT