RE: Normalisation and Greek characters

From: Kenneth Whistler ([email protected])
Date: Mon Mar 17 2003 - 18:29:54 EST

Next message: Kenneth Whistler: "Re: U+00D0, U+01b7 -- variants or distinct chars?"

Previous message: [email protected]: "U+00D0, U+01b7 -- variants or distinct chars?"
Maybe in reply to: David J. Perry: "Normalisation and Greek characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

David Perry wrote:

> Thanks to John for pointing me in the right direction; the normalization
> charts were not helpful, but after spending some time with UAX#15 and
> looking at the actual Unicode database, I see what is going on here.
>
> It seems strange to me that the Unicode book (where I initially looked)
> simply gives the decomposition for U+1F71 as U+03B1 followed by U+0301
> with no indication that anything else is involved (and likewise for
> other characters with singleton decompositions, such as the angrstrom).
> If U+1F71 decomposes to U+03AC, why is the other decomposition still
> given?

I deduce from this that you must be looking at the Unicode *2.0*
book, which did, indeed, give the canonical decomposition of
U+1F71 as U+03B1 + U+0301.

But the Unicode *3.0* book gives the canonical *mapping* of
U+1F71 as U+03AC.

The distinction was one of theory of presentation of the data.
Unicode 2.0 gave *full* decompositions for all characters.
That was true for UnicodeData-2.0.14.txt, which was used to
drive the formatting of the names list for the Unicode 2.0 book.

But starting with the Unicode 2.1.9 update, the decompositions
were given as decomposition mappings. Those constituted the
*proximate* decompositions -- decompositions into the next
closest stage. And the *full* decompositions had to be derived
by recursive application of all applicable decomposition
mappings. The reason for doing so is precisely because of
the kind of problem you've run into -- listing only the full
decompositions loses information about the intermediate steps
in the decomposition. Once Unicode 3.0 was published, and
once Unicode normalization was defined in UAX #15, this
distinction became vitally important to normalization, because
of the recomposition rules. And if you look at the Unicode 3.0
names list (which you can find online, if necessary), then
only the singleton decomposition mapping for U+1F71 is shown,
as noted above.

--Ken

Next message: Kenneth Whistler: "Re: U+00D0, U+01b7 -- variants or distinct chars?"
Previous message: [email protected]: "U+00D0, U+01b7 -- variants or distinct chars?"
Maybe in reply to: David J. Perry: "Normalisation and Greek characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Mar 17 2003 - 19:09:18 EST