Re: some light a few points

From: Mark Davis ☕ (mark@macchiato.com)
Date: Wed Dec 22 2010 - 15:58:19 CST

Next message: Asmus Freytag: "Re: some light a few points"

Previous message: Lisa Moore: "Re: Unicode for Tamil (Thamizh) - From a Pondicherrian in France"
In reply to: spir: "some light a few points"
Next in thread: Asmus Freytag: "Re: some light a few points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark

*— Il meglio è l’inimico del bene —*

On Wed, Dec 22, 2010 at 09:21, spir <denis.spir@gmail.com> wrote:

> Hello,
>
>
> -1- code validity
> I have long thought only values corresponding to surrogates were invalid
> codes. But I recently discovered both in D's builtin unicode-aware chars &
> strings, and on the site 'fileformat' (
> http://www.fileformat.info/info/unicode/char/ffff/index.htm) that some
> other codes are invalid, like fffe & ffff.
> I'm a bit lost. What's true, then? And where can I find actual and *clear*
> definitions of code validity?
>

See Chapter 3 in http://www.unicode.org/versions/Unicode6.0.0/, especially
"well-formed" vs "ill-formed"

> I also discovered in ICU docs that it does not reject unpaired surrogates (
> http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings).
> D instead instead rejct unpaired surrogates and more. ???
>

ICU treats unpaired surrogates as if they were unassigned characters, when
manipulating them as strings. That is a common technique, see discussions of
"UnicodeString" in Chapter 3.

> -2- grapheme meaningfulness
> I take the opportunity to ask about grapheme (in the unicode sense *)
> validity as well: the "grapheme cluster boundary" algorithm sems to quietly
> allows building meaningless "graphemes" such as base-less (sequences of)
> combining codes. What are we expected to do with them?
>

It depends on what you are trying to do. You can filter out degenerate cases
or keep them. For more information, see http://unicode.org/reports/tr29/

>
> -3- _unique_ ordering
> The "canonical" ordering algorithm does not provide a unique
> representation: codes with the same ordering class (ccc) are not ordered.
> For instance, most (all?) diacritics placed above have the same class (230).
> Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of
> ordering, while they represent the same piece of text.
>

That is incorrect. These *do not* represent the same text.

There are some cases, especially with combining characters with ccc=0 where
the canonical ordering is not sufficient. Moreover, in general the
normalization algorithms do not and cannot always give a unique output for
"the same text", since that phrase is so vague. "A" and "a" are the same
word in English, but are not merged by normalization; moreover, it may vary
by language: "aa" and "å" in Danish.

So you have to be much more precise as to what sense of "the same" that you
are looking for.

> I thought the core point of normalisation was precisely to provide a
> _unique_ form for each text --so that, for instance, one can safely and
> efficiently search/count/replace... But if I search the first form in a text
> that holds the second, I'll miss it.
>

What may help is for you to look at the UCA, in the section on matching.

>
>
> Denis
>
> (*) I mean here "grapheme" not in the common sense of graphical form of a
> phoneme, but in the Unicode sense of character in the common sense ;-)
> -- -- -- -- -- -- --
> vit esse estrany ☣
>
> spir.wikidot.com
>
>
>
>

Next message: Asmus Freytag: "Re: some light a few points"
Previous message: Lisa Moore: "Re: Unicode for Tamil (Thamizh) - From a Pondicherrian in France"
In reply to: spir: "some light a few points"
Next in thread: Asmus Freytag: "Re: some light a few points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 16:02:56 CST