Re: Algorithms for Unicode script detection

From: Khaled Hosny via Unicode <unicode_at_unicode.org>
Date: Thu, 6 Jul 2017 01:59:26 +0200

On Thu, Jul 06, 2017 at 09:43:29AM +1000, Simon Cozens via Unicode wrote:
> I want to segment a Unicode text into runs according to their script.
> I've had a look through UAX#24 in the hope of finding a standard
> algorithm for doing this, but there isn't one specified. The
> implementation section gives some good pointers for what to be careful
> with (paired punctuation, etc.) but I can't find a step-by-step
> algorithm similar to the bidi algorithm or collation algorithm.
>
> Equally, I don't see anything in ICU that segments into script-based
> runs. You can get script properties, but that doesn't help you resolve
> common characters in the context of a run.
>
> Does anyone know of an open-source algorithm for doing this?

There is source/extra/scrptrun/ in ICU source tree (but not part of the
API), apparently it is used by its ParagraphLayout library. (A copy if
this code is used by Pango, and another copy is used by LibreOffice).

Regards,
Khaled
Received on Wed Jul 05 2017 - 18:59:53 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 05 2017 - 18:59:53 CDT