Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 2 Feb 2013 00:35:05 +0000

On Fri, 1 Feb 2013 14:07:19 +0000
"Costello, Roger L." <costello_at_mitre.org> wrote:

> Hi Folks,
>
> The W3C recommends [1] text sent out over the Internet be in
> Normalized Form C (NFC):
>
> This document therefore chooses NFC as the
> base for Web-related early normalization.
>
> So why would one ever generate text in decomposed form (NFD)?

I thought this would be a very good question for a troll to pose, but it
seems I was wrong.

NFD is the simplest form for arbitrary collation and searching. Even
the theory of regular expressions is simpler in NFD. For example,
(<U+0323 COMBINING DOT BELOW, U+0307 COMBINING DOT ABOVE>)* is only a
regular expression (i.e. detectable by a deterministic finite state
machine) if one restricts its interpretation to the strings that
themselves are in NFD, viz the null string and <U+0323, U+0327>.

NFD can also be easier to use if one manipulating text where accents
may need to be changed manually. I was recently comparing word lists
and transliterating to a common, IPA notation. I was using a
keyboard mapping that generates NFC text if text is entered
correctly. While the forms of the corresponding words from
different lists should have been identical, I frequently had to change
the tone accents (the tones appear to have been recorded inaccurately),
and it was confusing that sometimes deleting one previous character
would delete just a tone mark (exactly what I wanted) and sometimes it
would take the vowel with it. I suppose I could have redefined
deletion of previous character to just take out one NFD element - that
would have to be encoded for each application.

> Do any programming languages output text in NFD? Does Java? Python?
> C#? Perl? JavaScript?

Do any of these automatically normalise on output?

> Do any tools produce text in NFD?

Surely many editors will produce text as it is entered. One should
also note that some Unicode-defined processes, such as capitalisation,
do not preserve canonical equivalence.

My favourite example of such a breakdown is canonically equivalent NFC
<U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0359 COMBINING
ASTERISK BELOW> and NFD <U+03B1 GREEK SMALL LETTER ALPHA, U+0359,
U+0345 COMBINING GREEK YPOGEGRAMMENI>, which capitalise to the
inequivalent <U+0391 GREEK CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL
LETTER IOTA, U+0359> and <U+0391, U+0359, U+0399>. (U+0359 COMBINING
ASTERISK BELOW was added to Unicode on the basis of its use in citing
damaged Greek text.) TUS 6.2 Section 5.19 still contains the untruth,
'Casing operations as defined in Section 3.13, Default Case Algorithms,
preserve canonical equivalence, but are not guaranteed to preserve
Normalization Forms.' I'm not sure that there is any point in formally
reporting it, but I've just done so because I can't remember whether I
ever formally reported this issue.

One should also note that unnormalised text may be produced even in
cases where NFC and NFD are the same. For example, in the Tai Tham
script the Thai word _keng_ 'clever' (a loan into Northern Thai) is, in
NFC, ᨠᩮ᩠᩵ᨦ <U+1A20 TAI THAM LETTER HIGH KA, U+1A6E TAI THAM VOWEL SIGN
E, U+1A60 TAI THAM SIGN SAKOT, U+1A75 TAI THAM SIGN TONE-1, U+1A26 TAI
THAM LETTER NGA>, but I would expect it to be entered graphical element
by element as ᨠᩮ᩠᩵ᨦ ka <U+1A20>, mai kee <U+1A6E>, mai yo <U+1A75>,
haang nga <U+1A60, U+1A26>. (Don't worry if the Tai Tham script
doesn't render - the point is that haang nga is the subscript from of
NGA.) In this case, the canonical combining class of U+1A60 was changed
from 0 to 9 between ISO approval and publishing in Unicode, and the
change wasn't spotted until too late.

There may be more compelling examples of unnormalised text for Hebrew -
I'm not au fait with the canonical ordering of Hebrew.

> Should I assume that any text my applications receive will always be
> normalized to NFC form?

NO!!

Richard.
Received on Fri Feb 01 2013 - 18:38:29 CST

This archive was generated by hypermail 2.2.0 : Fri Feb 01 2013 - 18:38:30 CST