From: Roozbeh Pournader (email@example.com)
Date: Tue May 24 2011 - 20:20:30 CDT
I agree with you that problems exist, but I don't agree that
applications that support Unicode have a license to treat canonically
equivalent strings in different ways, especially when displaying them.
My point in writing the email, was that the software "out there" still
needs to catch up to some parts of Unicode, and if you may be working
with that software, make sure you know their shortcomings before you
assume that it's a perfect world out there and convert everything to
NFC. Not that it's better to expect strings in the "logical" order of
what some of the users expect things to be and not work in other cases,
especially when the string is in NFC.
And I'm not saying these for philosophical reasons or as a standards
purist. I am writing and using various pieces of software every day that
would be much much simpler if I simply could assume that commonly
available software could show normalized strings the right way, instead
of expecting them to be in a script-specific software-specific
normalized order that I would need to figure out for each piece of
software I can't change.
In this world, several of my "scripts" need to support different tastes
of normalization and string comparison. In the future world were at
least everyone can handle NFC strings, I can normalize early in the game
and do much simpler comparisons everywhere else.
PS: For the record, I don't expect Microsoft to go and fix Windows XP
and Word 2003 this late in the process. This is a reality for people
like me today. I'm just hoping that it will be a smaller problem for
people who are writing similar code a few years from now when some of
these older applications go out of usage. (Or possibly, I'm just
On Wed, 2011-05-25 at 00:27 +0000, Peter Constable wrote:
> Uniscribe normalization is reasonably robust for Latin, Greek and
> Cyrllic. But it’s simply a fact that NFC normalization can have
> undesirable effects on various other scripts. In particular, the
> canonical ordering algorithm used in Unicode normalization can be a
> problem for various scripts. For example, in Biblical Hebrew, marks
> will get re-ordered into a sequence that is decidedly not what makes
> sense for users—the set of general classes (>= 200) and fixed-position
> classes (< 200) used for Hebrew lead to that result. There are issues
> for other scripts as well.
> These are issues inherent to normalization itself, regardless of the
> software in use. In those cases, Roozbeh’s point applies: emitting NFC
> “into the wild” can be as much of problem as emitting NFD.
> The only places where Unicode normalization is totally safe are those
> places for which it was created: not transforming data that will get
> persisted or transmitted to other users and processes, but in internal
> processing for comparing strings for the kinds of equivalences that
> Unicode normalization defines.
> From: firstname.lastname@example.org [mailto:email@example.com]
> On Behalf Of Roozbeh Pournader
> Sent: Tuesday, May 24, 2011 4:28 PM
> To: Phillips, Addison
> Cc: Christoph Päper; Unicode Discussion
> Subject: RE: Slots for Cyrillic Accented Vowels
> On Mon, 2011-05-23 at 08:17 -0700, Phillips, Addison wrote:
> [...] you generally should not emit NFD "into the wild"
> In the real world, of course, you should actually not emit NFC either.
> A famous case that comes to bite me again and again, is that some
> XP-era Microsoft applications don't render canonically equivalent
> strings the same way, so if you normalize something, you lose its
> preferred display and semantics. For example, the sequence <ARABIC
> LETTER SEEN, ARABIC SHADDA, ARABIC FATHA>, which is a kind of very
> normal and rather common sequence in Arabic, will be displayed wrongly
> in Windows XP's Uniscribe if one actually normalizes it (to either NFC
> or NFD), becoming <SEEN, FATHA, SHADDA>, which is displayed wrongly in
> both Notepad and Word 2003 under Windows XP.
This archive was generated by hypermail 2.1.5 : Tue May 24 2011 - 20:23:12 CDT