Re: some light a few points

From: Mark Davis ☕ (mark@macchiato.com)
Date: Wed Dec 22 2010 - 15:58:19 CST

  • Next message: Asmus Freytag: "Re: some light a few points"

    Mark

    *— Il meglio è l’inimico del bene —*

    On Wed, Dec 22, 2010 at 09:21, spir <denis.spir@gmail.com> wrote:

    > Hello,
    >
    >
    > -1- code validity
    > I have long thought only values corresponding to surrogates were invalid
    > codes. But I recently discovered both in D's builtin unicode-aware chars &
    > strings, and on the site 'fileformat' (
    > http://www.fileformat.info/info/unicode/char/ffff/index.htm) that some
    > other codes are invalid, like fffe & ffff.
    > I'm a bit lost. What's true, then? And where can I find actual and *clear*
    > definitions of code validity?
    >

    See Chapter 3 in http://www.unicode.org/versions/Unicode6.0.0/, especially
    "well-formed" vs "ill-formed"

    > I also discovered in ICU docs that it does not reject unpaired surrogates (
    > http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings).
    > D instead instead rejct unpaired surrogates and more. ???
    >

    ICU treats unpaired surrogates as if they were unassigned characters, when
    manipulating them as strings. That is a common technique, see discussions of
     "UnicodeString" in Chapter 3.

    > -2- grapheme meaningfulness
    > I take the opportunity to ask about grapheme (in the unicode sense *)
    > validity as well: the "grapheme cluster boundary" algorithm sems to quietly
    > allows building meaningless "graphemes" such as base-less (sequences of)
    > combining codes. What are we expected to do with them?
    >

    It depends on what you are trying to do. You can filter out degenerate cases
    or keep them. For more information, see http://unicode.org/reports/tr29/

    >
    > -3- _unique_ ordering
    > The "canonical" ordering algorithm does not provide a unique
    > representation: codes with the same ordering class (ccc) are not ordered.
    > For instance, most (all?) diacritics placed above have the same class (230).
    > Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of
    > ordering, while they represent the same piece of text.
    >

    That is incorrect. These *do not* represent the same text.

    There are some cases, especially with combining characters with ccc=0 where
    the canonical ordering is not sufficient. Moreover, in general the
    normalization algorithms do not and cannot always give a unique output for
    "the same text", since that phrase is so vague. "A" and "a" are the same
    word in English, but are not merged by normalization; moreover, it may vary
    by language: "aa" and "å" in Danish.

    So you have to be much more precise as to what sense of "the same" that you
    are looking for.

    > I thought the core point of normalisation was precisely to provide a
    > _unique_ form for each text --so that, for instance, one can safely and
    > efficiently search/count/replace... But if I search the first form in a text
    > that holds the second, I'll miss it.
    >

    What may help is for you to look at the UCA, in the section on matching.

    >
    >
    > Denis
    >
    > (*) I mean here "grapheme" not in the common sense of graphical form of a
    > phoneme, but in the Unicode sense of character in the common sense ;-)
    > -- -- -- -- -- -- --
    > vit esse estrany ☣
    >
    > spir.wikidot.com
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 16:02:56 CST