some light a few points

From: spir (
Date: Wed Dec 22 2010 - 11:21:47 CST

  • Next message: Lisa Moore: "Re: Unicode for Tamil (Thamizh) - From a Pondicherrian in France"


    -1- code validity
    I have long thought only values corresponding to surrogates were invalid codes. But I recently discovered both in D's builtin unicode-aware chars & strings, and on the site 'fileformat' ( that some other codes are invalid, like fffe & ffff.
    I'm a bit lost. What's true, then? And where can I find actual and *clear* definitions of code validity?
    I also discovered in ICU docs that it does not reject unpaired surrogates ( D instead instead rejct unpaired surrogates and more. ???

    -2- grapheme meaningfulness
    I take the opportunity to ask about grapheme (in the unicode sense *) validity as well: the "grapheme cluster boundary" algorithm sems to quietly allows building meaningless "graphemes" such as base-less (sequences of) combining codes. What are we expected to do with them?

    -3- _unique_ ordering
    The "canonical" ordering algorithm does not provide a unique representation: codes with the same ordering class (ccc) are not ordered. For instance, most (all?) diacritics placed above have the same class (230). Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of ordering, while they represent the same piece of text.
    I thought the core point of normalisation was precisely to provide a _unique_ form for each text --so that, for instance, one can safely and efficiently search/count/replace... But if I search the first form in a text that holds the second, I'll miss it.


    (*) I mean here "grapheme" not in the common sense of graphical form of a phoneme, but in the Unicode sense of character in the common sense ;-)
    -- -- -- -- -- -- --
    vit esse estrany ☣

    This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 11:34:09 CST