Re: NFD on u+AC00 contradicts NormalisationData.txt ?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jun 14 2006 - 17:00:09 CDT

  • Next message: Richard Wordingham: "Re: Some questions about Latin diacritics"

    From: "Theodore H. Smith" <delete@elfdata.com>
    > I'm testing some NFD code of mine. It does not do NFKD yet.
    >
    > This line from UnicodeData.txt
    >
    > AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
    >
    > It says it has no decomposition.

    Wrong.

    The main UCD does not contain data for every Hangul syllable. If you have read TUS, Hangul is handled specially, using algorithmic decomposition for all characters in this block.

    Any line in the UCD that contains ", First>" or ", Last>" indicates a special behavior of algorithms for a range of codepoints. This behavior is described in TUS, and the TUS specification will override the default properties displayed for such particular range of codepoints.

    This means that this file is partially compressed, as it really contains much less lines than the number of codepoints assigned to characters in the standard.

    Also, there are default properties defined to unassigned characters within specific blocks. Read TUS about them (notably the BiDi algorithm).

    And don't forget the additional file that is listing the characters which do have canonical decompositions in the main UCD file but that must NOT be recomposed when converting to NFC or NFKC (this list also includes all characters with a singleton canonical decompositions, but they are NOT listed in the additional composition exclusion file, given that you can determine it algorithmically when parsing the main UCD file).



    This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 17:10:14 CDT