Re: ISO character sets stable under NFC?

From: Philippe Verdy ([email protected])
Date: Mon Mar 28 2005 - 16:00:17 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Apostrophe"

Previous message: Philippe Verdy: "Re: Apostrophe (was: Re: Security Issues: Navajo)"
In reply to: Mark Davis: "Re: ISO character sets stable under NFC?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm not sure of that about the Latin-Greek ISO-8859 charset (at least for
the compatibillity forms)...

Look at the spacing tonos accent (U+0384, 0xB4 in ISO-8859-7), or spacing
dialitika-tonos accent (U+0385, 0xB5 in ISO-8859-7)

In the UCD we have those three lines:
00A8;DIAERESIS;Sk;0;ON;<compat> 0020 0308;;;;N;SPACING DIAERESIS;;;;
0384;GREEK TONOS;Sk;0;ON;<compat> 0020 0301;;;;N;GREEK SPACING TONOS;;;;
0385;GREEK DIALYTIKA TONOS;Sk;0;ON;00A8 0301;;;;N;GREEK SPACING DIAERESIS
TONOS;;;;

And the following composition exclusions in section (3) Singleton
Decompositions:
# 0344 COMBINING GREEK DIALYTIKA TONOS

These results in the following mappings:

#Unicode (ISO); NFC (ISO); NFD (ISO); NKFC (ISO); NFKD (ISO); CHARACTER NAME

00A8 (0xA8); 00A8 (0xA8); 00A8 (0xA8); 0020 0308 (0x20 ????); 0020 0308
(0x20 ????); # DIAERESIS (or GREEK DIALITIKA)

0384 (0xB4); 0384 (0xB4); 0384 (0xB4); 0020 0301 (0x20 ????); 0020 0301
(0x20 ????); # GREEK TONOS

0385 (0xB5); 0385 (0xB5); 00A8 0301 (0xA8 ????); 0020 0308 0301 (); 0020
0308 0301; # GREEK DIALYTIKA TONOS

With NFD, we have a problem because the string does not seem reversible;
however, ISO8859-7 does not contain any combining character, so a conversion
from Unicode to ISO8859-7 should first convert to NFC before converting, so
0xB5 becomes stable under NFD too. This is not true with NFKD, because 0020
0308 and 0020 0301 can't be recomposed safely with NFC.

This means that converting a Greek Unicode text to ISO8859-7 requires
allowing the composition using the compatibility mapping of DIAERESIS (alias
GREEK DIALITIKA), as if it was canonical.

There are probably similar cases in other ISO8859 charsets (I did not check
completely but this may occur in Latin-Thai ISO8859-11, but not in
Latin-Hebrew and Latin-Arabic that do not include any combining spacing
diacritic with compatiblity-only decomposition mappings in the UCD).

----- Original Message -----
From: "Mark Davis" <[email protected]>
To: "Kenneth Whistler" <[email protected]>; <[email protected]>
Cc: <[email protected]>
Sent: Monday, March 28, 2005 10:15 PM
Subject: Re: ISO character sets stable under NFC?

> Ken is most likely right; but it'd be better to test them to be sure.
>
> ‎Mark
>
> ----- Original Message -----
> From: "Kenneth Whistler" <[email protected]>
> To: <[email protected]>
> Cc: <[email protected]>
> Sent: Monday, March 28, 2005 11:08
> Subject: Re: ISO character sets stable under NFC?
>
>
>> Elliotte Rusty Harold asked:
>>
>> > Question:
>> >
>> > Aside from 8859-1, which, if any, of the ISO 8859 character sets are
>> > stable under normalization form C?
>>
>> It should be *all* of them.
>>
>> --Ken
>>
>> >
>> > For example, suppose I have a string of Unicode characters, each of
>> > which is defined in ISO-8859-2. I then normalize this string according
>> > to NFC. Is it guaranteed that the resulting string will be
>> > character-per-character identical to the original string?
>> >
>> > I believe this is true for 8859-1. Is it true for any of the other 8859
>> > character sets?
>>
>>
>>
>
>
>

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Apostrophe"
Previous message: Philippe Verdy: "Re: Apostrophe (was: Re: Security Issues: Navajo)"
In reply to: Mark Davis: "Re: ISO character sets stable under NFC?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Mar 28 2005 - 16:01:37 CST