Re: precomposed polytonic Greek characters with macrons and other diacritics

From: Elizabeth Mattijsen <liz_at_dijkmat.nl>
Date: Mon, 8 Feb 2016 20:29:35 +0100

> On 08 Feb 2016, at 20:10, Markus Scherer <markus.icu_at_gmail.com> wrote:
>
> On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <jtauber_at_jtauber.com> wrote:
> Even with all this, though, my own work includes accentuation and syllabification algorithms, all of which are made more cumbersome by the lack of precomposed characters indicating vowel length. I'm currently leaning towards adding a layer of "character" processing on top of Python 3's otherwise decent support that effectively treats the relevant character sequences as single characters even if they aren't (and can't be precomposed).
>
> I suggest you normalize the text (NFC or NFD), and then look for "grapheme clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
>
> In C++ and Java, you could use an ICU BreakIterator for the latter.

Might I suggest looking at Rakudo Perl 6’s implementation of NFG (Normalization Form Grapheme) which will generate synthetic codepoints on the fly under the hood.

For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf

Liz
Received on Mon Feb 08 2016 - 15:04:39 CST

This archive was generated by hypermail 2.2.0 : Mon Feb 08 2016 - 15:04:39 CST