some light a few points

From: spir (denis.spir@gmail.com)
Date: Wed Dec 22 2010 - 11:21:47 CST

Next message: Lisa Moore: "Re: Unicode for Tamil (Thamizh) - From a Pondicherrian in France"

Previous message: Doug Ewell: "Re: coloured characters"
Next in thread: Mark Davis ☕: "Re: some light a few points"
Reply: Mark Davis ☕: "Re: some light a few points"
Reply: Asmus Freytag: "Re: some light a few points"
Reply: Christoph P�per: "Re: some light a few points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hello,

-1- code validity
I have long thought only values corresponding to surrogates were invalid codes. But I recently discovered both in D's builtin unicode-aware chars & strings, and on the site 'fileformat' (http://www.fileformat.info/info/unicode/char/ffff/index.htm) that some other codes are invalid, like fffe & ffff.
I'm a bit lost. What's true, then? And where can I find actual and *clear* definitions of code validity?
I also discovered in ICU docs that it does not reject unpaired surrogates (http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings). D instead instead rejct unpaired surrogates and more. ???

-2- grapheme meaningfulness
I take the opportunity to ask about grapheme (in the unicode sense *) validity as well: the "grapheme cluster boundary" algorithm sems to quietly allows building meaningless "graphemes" such as base-less (sequences of) combining codes. What are we expected to do with them?

-3- _unique_ ordering
The "canonical" ordering algorithm does not provide a unique representation: codes with the same ordering class (ccc) are not ordered. For instance, most (all?) diacritics placed above have the same class (230). Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of ordering, while they represent the same piece of text.
I thought the core point of normalisation was precisely to provide a _unique_ form for each text --so that, for instance, one can safely and efficiently search/count/replace... But if I search the first form in a text that holds the second, I'll miss it.

Denis

(*) I mean here "grapheme" not in the common sense of graphical form of a phoneme, but in the Unicode sense of character in the common sense ;-)
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Next message: Lisa Moore: "Re: Unicode for Tamil (Thamizh) - From a Pondicherrian in France"
Previous message: Doug Ewell: "Re: coloured characters"
Next in thread: Mark Davis ☕: "Re: some light a few points"
Reply: Mark Davis ☕: "Re: some light a few points"
Reply: Asmus Freytag: "Re: some light a few points"
Reply: Christoph P�per: "Re: some light a few points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 11:34:09 CST