RE: Bit arithmetic on Unicode characters? from Shawn Steele on 2016-10-06 (Unicode Mail List Archive)

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Fri, 7 Oct 2016 00:42:08 +0000

Presumably a table-based approach would merely require rerunning the table-building script from the UCD when new versions were released.

-----Original Message-----
From: Unicode [mailto:unicode-bounces_at_unicode.org] On Behalf Of Richard Wordingham
Sent: Thursday, October 6, 2016 5:28 PM
To: unicode_at_unicode.org
Subject: Re: Bit arithmetic on Unicode characters?

On Thu, 6 Oct 2016 16:54:21 -0700
Ken Whistler <kenwhistler_at_att.net> wrote:

> On 10/6/2016 4:32 PM, Richard Wordingham wrote:
> > The
> > problem is that manually constructed lookup tables are prone to
> > human error.
>
> ... as are manually constructed algorithms that attempt to take
> advantage of sub-ranges of case pair adjacency in the Unicode code
> charts to do casing with bit arithmetic.

Yes, it's a trade-off. The application I had in mind is converting between mathematical letter variants and their 'plain' forms. Perhaps there is just enough information in the UCD to allow exhaustive, automated tests.

For _simple_ case folding, algorithmic case folding can be expanded to a list of range tests, generalising what is often done for ASCII.
Obviously the testing should be repeated with each new version of Unicode, which is straightforward if the case folding is compliant with Unicode. (Turkish would be a reason for not being compliant.)

Richard.
Received on Thu Oct 06 2016 - 19:42:36 CDT

This archive was generated by hypermail 2.2.0 : Thu Oct 06 2016 - 19:42:36 CDT