Re: precomposed polytonic Greek characters with macrons and other diacritics

From: James Tauber <jtauber_at_jtauber.com>
Date: Mon, 8 Feb 2016 17:59:10 -0600

On Mon, Feb 8, 2016 at 1:29 PM, Elizabeth Mattijsen <liz_at_dijkmat.nl> wrote:

> > On 08 Feb 2016, at 20:10, Markus Scherer <markus.icu_at_gmail.com> wrote:
> >
> > On Mon, Feb 8, 2016 at 10:47 AM, James Tauber <jtauber_at_jtauber.com>
> wrote:
> > Even with all this, though, my own work includes accentuation and
> syllabification algorithms, all of which are made more cumbersome by the
> lack of precomposed characters indicating vowel length. I'm currently
> leaning towards adding a layer of "character" processing on top of Python
> 3's otherwise decent support that effectively treats the relevant character
> sequences as single characters even if they aren't (and can't be
> precomposed).
> >
> > I suggest you normalize the text (NFC or NFD), and then look for
> "grapheme clusters".
> http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> >
> > In C++ and Java, you could use an ICU BreakIterator for the latter.
>
> Might I suggest looking at Rakudo Perl 6’s implementation of NFG
> (Normalization Form Grapheme) which will generate synthetic codepoints on
> the fly under the hood.
>
> For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf
>

Thanks very much, I'll look into this.

Having done a Python implementation of the UCA, I'm quite looking forward
to doing more Unicode tools for Python.

James
Received on Mon Feb 08 2016 - 18:00:09 CST

This archive was generated by hypermail 2.2.0 : Mon Feb 08 2016 - 18:00:09 CST