From: Mark Davis ☕ (email@example.com)
Date: Wed Dec 22 2010 - 15:58:19 CST
*— Il meglio è l’inimico del bene —*
On Wed, Dec 22, 2010 at 09:21, spir <firstname.lastname@example.org> wrote:
> -1- code validity
> I have long thought only values corresponding to surrogates were invalid
> codes. But I recently discovered both in D's builtin unicode-aware chars &
> strings, and on the site 'fileformat' (
> http://www.fileformat.info/info/unicode/char/ffff/index.htm) that some
> other codes are invalid, like fffe & ffff.
> I'm a bit lost. What's true, then? And where can I find actual and *clear*
> definitions of code validity?
See Chapter 3 in http://www.unicode.org/versions/Unicode6.0.0/, especially
"well-formed" vs "ill-formed"
> I also discovered in ICU docs that it does not reject unpaired surrogates (
> D instead instead rejct unpaired surrogates and more. ???
ICU treats unpaired surrogates as if they were unassigned characters, when
manipulating them as strings. That is a common technique, see discussions of
"UnicodeString" in Chapter 3.
> -2- grapheme meaningfulness
> I take the opportunity to ask about grapheme (in the unicode sense *)
> validity as well: the "grapheme cluster boundary" algorithm sems to quietly
> allows building meaningless "graphemes" such as base-less (sequences of)
> combining codes. What are we expected to do with them?
It depends on what you are trying to do. You can filter out degenerate cases
or keep them. For more information, see http://unicode.org/reports/tr29/
> -3- _unique_ ordering
> The "canonical" ordering algorithm does not provide a unique
> representation: codes with the same ordering class (ccc) are not ordered.
> For instance, most (all?) diacritics placed above have the same class (230).
> Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of
> ordering, while they represent the same piece of text.
That is incorrect. These *do not* represent the same text.
There are some cases, especially with combining characters with ccc=0 where
the canonical ordering is not sufficient. Moreover, in general the
normalization algorithms do not and cannot always give a unique output for
"the same text", since that phrase is so vague. "A" and "a" are the same
word in English, but are not merged by normalization; moreover, it may vary
by language: "aa" and "å" in Danish.
So you have to be much more precise as to what sense of "the same" that you
are looking for.
> I thought the core point of normalisation was precisely to provide a
> _unique_ form for each text --so that, for instance, one can safely and
> efficiently search/count/replace... But if I search the first form in a text
> that holds the second, I'll miss it.
What may help is for you to look at the UCA, in the section on matching.
> (*) I mean here "grapheme" not in the common sense of graphical form of a
> phoneme, but in the Unicode sense of character in the common sense ;-)
> -- -- -- -- -- -- --
> vit esse estrany ☣
This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 16:02:56 CST