Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 2 Feb 2013 01:51:27 +0000

On Fri, 1 Feb 2013 23:51:34 +0000 (GMT)
Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk> wrote:

> On 2013-02-01, Costello, Roger L. <costello_at_mitre.org> wrote:
> > So why would one ever generate text in decomposed form (NFD)?
>
> Text that I type is quite likely to be in decomposed (or at least not
> composed) form, because I find it a lot easier to have a few
> keystrokes for combining accents than to set up compose key sequences
> for all the possible composed characters.
> For example,
> ǂhèẽ-ǂhèẽ ǃn̥à̰ĩ-ǃn̥à̰ĩ
> was part of the title of a talk. Is there a composed form of à̰? I
> don't know, and don't want to!

> Much easier to do searches and other text processing on it, too.
> (The current dictionary project for this language uses NFD in its data
> files, too.)

But if you use a member of the Keyman family of inputs methods (I've
been using Keyman for Linux (KMFL), you can set up a keyboard so you
just enter that using XSAMPA keystrokes, e.g.
=\he_Le~-=\he_Le~ !\n_0a_L_ki~-!\n_0a_L_ki~ and get ǂhèẽ-ǂhèẽ
ǃn̥à̰ĩ-ǃn̥à̰ĩ. The keyboard mappinɡ definition determines whether the
combining grave from ‘_L’ composes. The only problem is that to get NFC
you have to remember to type a_L_k to get the NFC form à̰ rather than
a_k_L, which delivers the NFD form à̰, but do you not have to remember
the order of diacritics anyway? Simple codepoint-sequence based
searching only works if diacritics are in the correct order.

Having set up an NFC-deliverinɡ XSAMPA-based keyboard so that it had
rules O => ɔ, O\ => ʘ, O\\ => O, I’ve found it would be a lot more
useful if I’d been a lot less puristic and set it up so that I had O =>
O, O\ => ɔ, O\\ => ʘ. I use multiple backslashes to get some additional
characters and recover ASCII, an idea I ɡot from Martin Hosken’s IPA
keyboard. I’m currently pondering how to maintain puristic and
‘practical’ versions from the same source files. Ideally I’d also merge
in the related Emacs keyboard definition.

However, as you say, processing is a lot simpler if the text is
guaranteed to be in NFD.

Richard.
Received on Fri Feb 01 2013 - 19:54:04 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 01 2013 - 19:54:05 CST