From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Mar 04 2005 - 10:00:43 CST
On Fri, 4 Mar 2005 16:32:12 +0100, "Marcin 'Qrczak' Kowalczyk" wrote:
>
> > Unicode Standard Annex #15 (http://www.unicode.org/reports/tr15/)
> > specifies that precomposed characters that are added after Unicode
> > 3.0 are excluded from composition (i.e. not recomposed when NFC is
> > applied to them). As all characters beyond the BMP were added in
> > Unicode 3.1 or later, you can effectively ignore any character
> > greater than U+FFFF (or any surrogates if you are processing UTF-16)
> > when applying NFC to a text stream.
>
> The last sentence is not true: precomposed characters above U+FFFF
> must be *decomposed* by NF*C*.
Yes, of course you're right. The first thing you do with NFC is to apply
canonical decomposition, so indeed you can't ignore supra-BMP characters in case
there are any precomposed musical symbols or CJK compatibility ideographs. You
also need to apply canonical reordering between the decomposition and
composition stages, and there are 30 characters in the SMP that have a canonical
combing class greater than 0, and so which cannot be ignored either. The only
part of the process of applying NFC that you can safely ignore supra-BMP
characters is the final stage when you do the composition.
Andrew
This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 10:01:23 CST