Re: ISO character sets stable under NFC?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 28 2005 - 16:00:17 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Apostrophe"

    I'm not sure of that about the Latin-Greek ISO-8859 charset (at least for
    the compatibillity forms)...

    Look at the spacing tonos accent (U+0384, 0xB4 in ISO-8859-7), or spacing
    dialitika-tonos accent (U+0385, 0xB5 in ISO-8859-7)

    In the UCD we have those three lines:
    00A8;DIAERESIS;Sk;0;ON;<compat> 0020 0308;;;;N;SPACING DIAERESIS;;;;
    0384;GREEK TONOS;Sk;0;ON;<compat> 0020 0301;;;;N;GREEK SPACING TONOS;;;;
    0385;GREEK DIALYTIKA TONOS;Sk;0;ON;00A8 0301;;;;N;GREEK SPACING DIAERESIS
    TONOS;;;;

    And the following composition exclusions in section (3) Singleton
    Decompositions:
    # 0344 COMBINING GREEK DIALYTIKA TONOS

    These results in the following mappings:

    #Unicode (ISO); NFC (ISO); NFD (ISO); NKFC (ISO); NFKD (ISO); CHARACTER NAME

    00A8 (0xA8); 00A8 (0xA8); 00A8 (0xA8); 0020 0308 (0x20 ????); 0020 0308
    (0x20 ????); # DIAERESIS (or GREEK DIALITIKA)

    0384 (0xB4); 0384 (0xB4); 0384 (0xB4); 0020 0301 (0x20 ????); 0020 0301
    (0x20 ????); # GREEK TONOS

    0385 (0xB5); 0385 (0xB5); 00A8 0301 (0xA8 ????); 0020 0308 0301 (); 0020
    0308 0301; # GREEK DIALYTIKA TONOS

    With NFD, we have a problem because the string does not seem reversible;
    however, ISO8859-7 does not contain any combining character, so a conversion
    from Unicode to ISO8859-7 should first convert to NFC before converting, so
    0xB5 becomes stable under NFD too. This is not true with NFKD, because 0020
    0308 and 0020 0301 can't be recomposed safely with NFC.

    This means that converting a Greek Unicode text to ISO8859-7 requires
    allowing the composition using the compatibility mapping of DIAERESIS (alias
    GREEK DIALITIKA), as if it was canonical.

    There are probably similar cases in other ISO8859 charsets (I did not check
    completely but this may occur in Latin-Thai ISO8859-11, but not in
    Latin-Hebrew and Latin-Arabic that do not include any combining spacing
    diacritic with compatiblity-only decomposition mappings in the UCD).

    ----- Original Message -----
    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Kenneth Whistler" <kenw@sybase.com>; <elharo@metalab.unc.edu>
    Cc: <unicode@unicode.org>
    Sent: Monday, March 28, 2005 10:15 PM
    Subject: Re: ISO character sets stable under NFC?

    > Ken is most likely right; but it'd be better to test them to be sure.
    >
    > ‎Mark
    >
    > ----- Original Message -----
    > From: "Kenneth Whistler" <kenw@sybase.com>
    > To: <elharo@metalab.unc.edu>
    > Cc: <unicode@unicode.org>
    > Sent: Monday, March 28, 2005 11:08
    > Subject: Re: ISO character sets stable under NFC?
    >
    >
    >> Elliotte Rusty Harold asked:
    >>
    >> > Question:
    >> >
    >> > Aside from 8859-1, which, if any, of the ISO 8859 character sets are
    >> > stable under normalization form C?
    >>
    >> It should be *all* of them.
    >>
    >> --Ken
    >>
    >> >
    >> > For example, suppose I have a string of Unicode characters, each of
    >> > which is defined in ISO-8859-2. I then normalize this string according
    >> > to NFC. Is it guaranteed that the resulting string will be
    >> > character-per-character identical to the original string?
    >> >
    >> > I believe this is true for 8859-1. Is it true for any of the other 8859
    >> > character sets?
    >>
    >>
    >>
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 28 2005 - 16:01:37 CST