Re: Small Java implementation of NFC

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Fri Mar 04 2005 - 10:00:43 CST

  • Next message: Doug Ewell: "Re: Ambiguity and disunification"

    On Fri, 4 Mar 2005 16:32:12 +0100, "Marcin 'Qrczak' Kowalczyk" wrote:
    >
    > > Unicode Standard Annex #15 (http://www.unicode.org/reports/tr15/)
    > > specifies that precomposed characters that are added after Unicode
    > > 3.0 are excluded from composition (i.e. not recomposed when NFC is
    > > applied to them). As all characters beyond the BMP were added in
    > > Unicode 3.1 or later, you can effectively ignore any character
    > > greater than U+FFFF (or any surrogates if you are processing UTF-16)
    > > when applying NFC to a text stream.
    >
    > The last sentence is not true: precomposed characters above U+FFFF
    > must be *decomposed* by NF*C*.

    Yes, of course you're right. The first thing you do with NFC is to apply
    canonical decomposition, so indeed you can't ignore supra-BMP characters in case
    there are any precomposed musical symbols or CJK compatibility ideographs. You
    also need to apply canonical reordering between the decomposition and
    composition stages, and there are 30 characters in the SMP that have a canonical
    combing class greater than 0, and so which cannot be ignored either. The only
    part of the process of applying NFC that you can safely ignore supra-BMP
    characters is the final stage when you do the composition.

    Andrew



    This archive was generated by hypermail 2.1.5 : Fri Mar 04 2005 - 10:01:23 CST