Re: PRI#86 Update

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 10 2006 - 18:30:03 CDT

  • Next message: Mark Davis: "Re: PRI#86 Update"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > Actually a font that produces an incorrect glyph makes the rendering system
    > non-compliant, even if it is the glyph shown in the codecharts. Of course,
    > that does raise the question of how one knows what character U+0EA3 LAO
    > LETTER LO LING is. (Hint: Don't ask in Vientiane!) I suppose the comment
    > 'Based on TIS 620-2529' is important in this context. There's a better
    > solution in Unicode 5.0. On the other hand, digits can generally be
    > identified from their properties.

    I am speaking here about the errors that were corrected in the corrigendas integrated into Unicode 4.0.1. If an application claims to be conforming to Unicode 4.0, it is still conforming today, even if it displays the wrong glyph, but still mathes the glyphs that were published in Unicode 4.0 and corrected today.

    This is certainly in a gray area, because a system that uses the corrected glyph is also conforming to Unicode 4.0 even though it uses the corrigendas. Only an application that says it conforms to Unicode 4.01 but does not take the corrigendas into account is non-conforming and providing bad hints.

    An application that says it conforms to Unicode 3.0 was not supposed to integrate the corrigendas so it would not include any support for the characters that were added in Unicode 4.0 with an error corrected later. Whatever this application produces, it is still conforming to Unicode 3.0.

    That's why Unicode conformance needs to be tracked per version number: this reduces the number of compatibility tests to perform when building heterogeneous systems, because the incompatibility are well-known and listed in each exact Unicode version number: if a text does not use any codepoint with such incompatibility, it will be working in an upward compatibility manner, and no correction is necessary in the encoded texts, or in the application using them.

    Note: we are speaking about PRI#86, and this only concerns the normalization processes: the discussions about glyphs are irrelevant here.

    My initial comment was about explicitly defining the conditions where the stability of the normalizations (i.e. the uniqueness of its result in all Unicode versions) can be guaranteed: it is not guaranteed on unallocated codepoints; for example the new Balinese vowels with tedung that will get canonical equivalences in Unicode 5.0 (mappings that still don't exist in any published version of Unicode) so that NO existing conforming normalizing processes are supposed to return or consider canonically equivalent, for the purpose of case mappings, collation (tailored or not), and other text transforms.

    My comment about collation is still true: a conforming implementation of collation MUST return the same collation keys (except for the last implicit level based on codepoint values only) for all canonically equivalent strings: this requirement cannot be guaranteed in any collator working in unallocated codepoints. This means that the result of the collator must explicitly track the version number on which it is based (for example in RDBMS engines for table sorts.

    To make this sort order stable, the created database should store somewhere in a database or table meta-attribute which Unicode version it adheres, otherwise, the sort order will become inconsistant (and this may break processes assuming that sorted lists generated by a database will be comparable over time, for example with diffs, or with binary lookups.) I just wonder for example if Oracle allows the client of the RDBMS to specify the Unicode version it assumes, and if the server can honor the Unicode version requested by the client and behave consistently, even after the RDBMS server has been upgraded to support the new Unicode version which covers now more characters (and so which implements collation based on additional canonical equivalences).

    One way to avoid such problems would be that the RDBMS server will reject all strings (and return an SQL error) that contain codepoints not allocated in the Unicode version specified by the client (but in fact this is more complex, because SQL searches should behave as if the database on the server did not contain any row with such characters, and this would impact things like COUNT(*) and other aggregates, tests of existence or non-existence... For this reason, I do think that the server should impose the Unicode version it uses for each database. To upgrade a database to support a newer Unicode version, one should first perform a maintenance check on the database, or the server should simply forbid storing any unallocated codepoint in the database as long as it has not been marked to support and use the new Unicoder version (such marking of databases should not be allowed as long as the server software does not include the support for the newer Unicode version).

    If the server accepts SQL queries containing any valid codepoint without checking first whever they are allocated, then the collation will become inconsistant regarding canonical equivalences when the server software is upgraded with a new collation rules table (based on the new DUCET version) that support newly allocated characters. As this check may be a burden on the server SQL performance, this check is often not performed, but this has a price when the server software gets upgraded: the database must be checked with a routine maintenance to upgrade it to the newer Unicode version (this may be a very lengthy process to perform on giant databases with terabytes of text data), and the compatibility of clients must be checked too.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Wed May 10 2006 - 18:34:25 CDT