CEN TC304 - comments to prENV 13710

L2/00-029

From: Thorgeir Sigurdsson [thorgeir@stri.is]
Sent: Wednesday, January 26, 2000 2:20 PM

Subject: (CEN/TC304 N933) CEN approval of ENV 13710 European Ordering Rules

Lisa,

I'm still trying to find out whether the comments below are still going to be addressed at a TC304 disposition of comments meeting or whether the approval was on an 'as-is' basis. In the meantime, this could be made a UTC document.

A./

Dear TC304 member,

the ballot issued by CEN Central Secretariat on the European prestandard prENV13710 European Ordering Rules ended 2 Dec 1999.

The prestandard was approved. You will find a short report on this approval in N933

<http://www.stri.is/TC304/DOCS/N933.html>http://www.stri.is/TC304/DOCS/N933.html

You will also find the text of N933 included below:

My best regards

Þorgeir secretary of CEN/TC304

Subject: ENV 13710 European Ordering Rules- Approved

Source: TC304 secretariat

Date: 19. January 2000

Action: For information.

With a Dispatch Notice, dated 4th of January 2000, the CEN Central Secretariat has announced the result of a formal vote on draft European prestandard and has concluded that it has been approved and requested the Technical Committee to prepare, and to provide CEN/CS with, the definitive texts. CEN/CS will then proceed with the distribution of the definitive text.

This information in this document is taken from this Dispatch Notice.

CEN member country Vote #votes Remarks

Austria Accepted 4

Belgium Accepted 5

Czech RepublicAccepted 3

Denmark Accepted 3

Finland Accepted 3

France Accepted 10

Germany Accepted 10

Greece No vote received 5

Iceland Accepted 1

Ireland Accepted 3

Italy Accepted 10

Luxembourg Accepted 2

Netherlands No vote received 5

Norway No vote received 3

Portugal Abstention 5 No answer from those converned

Spain Accepted 8

Sweden Rejected 4 Comments

Switzerland Abstention 5 We abstain from voting; no interested parties in Switzerland.

United Kingdom Rejected 10 Comments sent via email

Total of voting12(62) accepted 2(14) rejected 2(10) abstain

The comments from Sweden and the UK follow:

1999-11-19

Swedish comments on prENV 13710 European Ordering Rules - Ordering of characters from the Latin, Greek and Cyrillic scripts

Sweden votes No on prENV 13710 as it currently stands. If our comments (except the editorial ones) are dealt with in an acceptable fashion our No will be changed to Yes.

SE Comment 1: all: The terminology used should be fully in line with the terminology used in ISO/IEC DIS 14651. If a slightly different terminology is used in some other related standard, then an exposition of the terminology differences should be in an annex, but not affect the terminology used in the main text. If copies of the 14651 definitions are needed, put them in an annex. Note in particular clause 4: no new or modified, nor copied definitions should occur here. Note also in particular annex A.2.1: the use of the word "key" is here vary different from that used in 14651. This is confusing. Harmonisation of terminology between 14651 and prENV 13710 is important.

SE Comment 2: foreword (editorial): The last two paragraphs of the "Foreword" does not belong there, but should be moved to a separate "Introduction" (also an unnumbered heading), on a separate page. A "Foreword" usually contains general text not related to the technical (or similar) content. An "Introduction" on the other hand usually does refer to the content of the document.

SE Comment 3: all, but clause 7 and annex F in particular: The (pre)normative ordering must absolutely be a minimal tailoring of the 14651 CTT. Now that the 14651 CTT will have stable level 1 weight names, such a minimal tailoring should be unproblematic.

SE Comment 4: clause 1, second paragraph: this paragraph states that the tailoring in annex F and that (presently) in clause 7 are formally equivalent. THIS IS FALSE, they are in no way equivalent, and it must be the minimal tailoring that is normative. The very different tailoring now in clause 7 is flawed and should be deleted. If given MES-3 data, or full 10646 data the results will be very different for the two current tailoring. And the tailoring now in clause 7 does not give acceptable results, and must therefore not be (pre)normative.

SE Comment 5: clauses 2 and 3: ISO/IEC 10646-1:1993 is not fully compatible with that standard including certain revisions, and the 1993 version should now never be referenced (pre)normatively. In addition some characters not in the 1993 edition of 10646 are being used in the prENV 13710 .

SE Comment 6: clause 7: See comment 3 above. Preferably the current content of clause 7 is deleted entirely.

SE Comment 7: clause 7: A "machine readable" (without OCR, and without extraneous formatting) file for the [to be minimal; essentially from annex F] tailoring should be available easily, preferably via the WWW. (Editorial: preferably without any pages as large bitmapped scanned pages.)

SE Comment 8: clause 7: The "reorder-after" statements at the end have null effect. They and the "section" lines should be deleted, even if this "expansion" is kept as an informative annex. (Which it shouldn't since it introduces a confusing different tailoring.)

SE Comment 9: clause 7: Most (all?) of the characters in the final note are already in 10646 with the latest amendments.

SE Comment 10: clause 7: 'Ligature ij' is sorted exactly (all levels) as a 'dotless i' followed by a 'j', which it should not be, there should be a distinction at some level (level 2 or 3).

SE Comment 11: clause 7: Sharp s is sorted exactly (on all levels) as a long s followed by a round s, though nearly right, there should be a distinction at some level. 14651 handles this case properly.

prENV 13710 (a): <U0111> <S705>;<STROKE>;<MIN>;<U0111>

prENV 13710 (b): <U00F0> <S705>;<MODIFIED>;<MIN>;<U00F0>

Should be (a): <U0111> <S0064>;<BASE><VRNT1>;<MIN><MIN>;<U0111>

Should be (b): <U00F0> <S0064>;<BASE><VRNT2>;<MIN><MIN>;<U00F0>

SE Comment 12: annex A.1.2: The naming of characters in 10646 is sometimes mistaken. The 'letterness' property is best found in the Unicode character database table over character properties (general category L with all its subcategories: Lu, Ll, Lt, Lo).

SE Comment 13: annex A.1.3: "first level letter" is a strange concept. Best removed.

SE Comment 14: annex A.1.3: When listing letters in collation order (before any further tailoring), and only considering the first level, only the lowercase or only the uppercase version should be listed. Case distinctions are only made at level 3. Diacritics differences are more important (level 2) and not listed.

SE Comment 15: annex A.1.3: An annex should not need to have its own (confusing) definitions: delete annex A.1.

SE Comment 16: annex A.1.5, note: some character encoding/decoding mismatch has occurred during the document processing.

SE Comment 17: annex A.1.11: "spacing character" usually has another meaning: all characters that are not non-spacing marks (like combining accents). Most characters are then "spacing". This definition is thus confusing, and needless.

SE Comment 18: annex A.1.6: The notion(s) of equivalence in Unicode are different from this (though apparently slightly related). This is confusing and needless.

SE Comment 19: annex A.2.2, second note: the proper handling of numerals is not as trivial as hinted here, see 14651 annex on collation preparation, subsection on numerals.

SE Comment 20: annex A.3: This subannex adds absolutely nothing to the (pre)standard, and should therefore be deleted.

SE Comment 21: page 35: could not be printed due to the massive amount of bitmapped data (several tens of megabytes). It is still likely to really contain text only. Similarly, some of the other pages (that could be printed, but with difficulty) contains large bitmaps, really with text (in tables) only.

SE Comment 22: annex A.8.3: editorial problems with the Cyrillic letters? There is no apparent correspondence between the columns.

SE Comment 23: annex D: the content here, together with all and any other text on preparation, e.g. of numerals and other things, should be collected, and somewhat expanded upon, in a special annex on preparation.

SE Comment 24: annex E: This presentation is definitely unfit for human review. Please use a suitable excerpt from the unicodedata3.0.0.txt file instead.

SE Comment 25: annex F: See also comments 3 and 4 above. The minimal tailoring MUST the normative one, and gives quite different, and much better, results than that currently in clause 7.

SE Comment 26: annex F: The reordering of some of the second level weights has not been minimised. One should also investigate the possibility of completely harmonising 14651 and prENV 13710 on this point, resulting in no reordering of existing level 2 weights. MODIFIED and SUBSTITUTE should be replaced by VRNT1, MODIFIED2 should be replaced by VRNT2. (Probably STROKE can be replaced by VRNT1 also. Needs some double-checking though.)

SE Comment 27: annex F: The new handling of Cyrillic in 14651 is accepted, so there should be no repetition of it in the PRENV 13710 .

SE Comment 28: annex F: (except possibly for STROKE), none of the lines saying "reorder-after" has any noticeable effect, and should be deleted.

SE Comment 29: annex F: The actual table entries don't have the same construction as in 14651. E.g.,

prENV 13710 (a): <U0111> <S705>;<STROKE>;<MIN>;<U0111>

prENV 13710 (b): <U00F0> <S705>;<MODIFIED>;<MIN>;<U00F0>

Should be (a): <U0111> <S0064>;<BASE><VRNT1>;<MIN><MIN>;<U0111>

Should be (b): <U00F0> <S0064>;<BASE><VRNT2>;<MIN><MIN>;<U00F0>

SE Comment 30: general remark: If the SE comments on the tailoring (that the PRENV 13710 constitutes of the CTT) are handled properly, the resulting tailoring is most likely applicable unchanged also for MES-3 and the full 10464 repertoire also. The resulting delta/tailoring will also be very small, essentially ONLY making a few Latin letters into variants of a corresponding base form. Having only a (properly constructed) minimal delta is very important, since many implementations will allow the nearly full 10646 repertoire. Therefore the current content of clause 7 is unacceptable, as seen in several other comments here, because it will give aberrant results if a collator based on it is fed text with anything outside of the MES-2. If something similar to the current content of clause 7 is kept (e.g. in an informative annex), which we think is a very bad idea, it must be very, very clearly stated the results will be different from that given by the minimal tailoring, and that the ex-clause 7 tailoring is only applicable in the very rare event that the implementation is restricted to MES-2. We expect that next to no implementation will be restricted to exactly MES-2.

UK COMMENTS ACCOMPANYING VOTE OF DISAPPROVAL ON prENV 13710 EUROPEAN ORDERING RULES

A. General points

The UK believes that default ordering using prENV 13710 and ISO/IEC FCD 14651 (of which prENV 13710 is a profile) should produce identical results (currently different results would be obtained) and that conformance issues should be more firmly addressed, as noted in points 1-2 of the UK's detailed comments.

The main reason for the differences is earlier decisions by CEN/TC304 on ordering of basic letters, accented letters, modified letters and digraphs, which have been superseded by more recent developments in ISO/IEC FCD 14651 and the Unicode Collation Algorithm. CEN/TC304 decisions made before those developments should be revised, and prENV 13710 amended accordingly, in the light of such international and industry standards.

In addition, there are other points relating to inclusion of contents, improvements in the Conformance clause, expansion of normative text, and amendments to the default table, which need improvement, to ensure compatibility between prENV 13710 and ISO/IEC FCD 14651.

The UK therefore reluctantly votes NO on the present draft. Accommodating these principles will change the UK vote to a YES vote.

B. Summary of detailed points

1. The relationship of prENV 13710 to ISO/IEC FCD 14651

ISO/IEC FCD 14651 and the closely related Unicode Collation Algorithm are likely to dominate ordering practice in Europe, and indeed worldwide. As a profile of ISO/IEC FCD 14651, prENV 13710 should more closely adhere to those related standards, to ensure that using any of these standards results in identically ordered data.

2. Contents and Conformance

A Contents list, and self-contained Conformance clause should be provided. The text should be revised to include the word "shall" in several instances.

3. Normative detail

The body of the standard (except the default table in section 7) is very slight. Some Annex text (particularly from Annex A) should be moved to the body of the standard, or annex text made normative.

Also text should enumerate the four levels of ordering:

Level 1: Digits, Latin, Greek, and Cyrillic letters, in that order [1]

Level 2: Diacritics [2]

Level 3: Case: upper case/lower case (or subscript etc., if relevant)

Level 4: Symbols/specials (not shown in the prENV 13710 default table)

4. Notation (prENV 13710 section 2)

The conventions and notations used in the default table, and any differences from comparable entries in ISO/IEC FCD 14651, should be explained in prENV 13710 section 2 (Notation).

5. Definitions (prENV 13710 section 4)

Additional definitions in informative Annex A should be combined with those in prENV 13710 section 4, and should also be consistent with those in ISO/IEC FCD 14651, and with other standards of ISO/IEC JTC1 and CEN/TC304.

Definitions should define, and not comprise enumerations or exceptions, as in A.1.1, A.1.3 (Note re Greek "F"); A.1.9; and A.1.10.

6. Tailorability (prENV 13710 section 6)

This section should be expanded. Brief text on specific deltas in Annex F might be useful. It should explain how symbols from ISO/IEC 10646-1 are ordered, whether they need to be listed in Section 7 itself or not.

7. Default table (prENV 13710 section 7)

Different orderings for Latin letters in prENV 13710 and ISO/IEC FCD 14651 must be avoided, if the default tables are used. Unaccented letters; accented letters; digraphs; and modified letters are ordered differently in each.

Neither ISO/IEC FCD 14651 or the Unicode Collation Algorithm distinguish "basic" letters (like A, B) or "modified" letters (like ETH, EZH, THORN), nor do the Greek or Cyrillic sections of prENV 13710 (pages 5-8 of prENV 13710).

Earlier decisions in CEN/TC304, to include "Levels of letters", which have been superseded by later developments in ISO/IEC FCD 14651 and the Unicode Collation Algorithm, should be overridden. "Levels of letters" are not included in those standards.

LATIN LETTER KRA should also be ordered as modified Q, as in Danish and Greenlandic practice, not modified K.

8. Annex A definitions, and Levels

Annex A definitions should be moved to Section 4, and any differences resolved, particularly over "levels of letters", "special character" and "spacing character."

"Levels of comparison" as in ISO/IEC FCD 14651 should be added to prENV 13710.

9. Tables in Annex A

Tables should be simplified. Table 1 (Diacritical marks) should rely only on ISO/IEC 10646-1 names. Relationships in ordering single and multiple diacritics should be pointed out: Table 2 might be avoided.

For Table 3, section A.8.3 and First level letter and Second level letter descriptions should be removed,

10. Annex B - Word by word ordering

Instructions for achieving word by word ordering should be more detailed.

11. Annexes C and D

Because prENV 13710 is a profile of ISO/IEC FCD 14651, these annexes should state that the ordering methods described additional to the normal provisions of both ISO/IEC FCD 14651 and prENV 13710, and might achieve different results to the default table.

C. Detailed points

1. The relationship of prENV 13710 to ISO/IEC FCD 14651

ISO/IEC FCD 14651 and the closely related Unicode Collation Algorithm are likely to dominate ordering practice in Europe, and indeed worldwide.

The European Ordering Rules draft (prENV 13710) is explicitly a profile of ISO/IEC FCD 14651 - the foreword states that prENV 13710 provides 'a European default ordering table in the syntax of the forthcoming ordering standard ISO/IEC_FCD_14651.2 of which the present standard is a "profile".'

The UK therefore considers that default ordering of text or data which uses the repertoire described in prENV 13710 should produce IDENTICAL results whether users implement prENV 13710 or ISO/IEC FCD 14651 in ordering such data. Currently DIFFERENT results would be obtained in each case.

It is important that any differences between equivalent international and European standards are either eliminated, or minimised. Any differences should be fully justified and clearly flagged, in those cases where differences exist.

2. Contents and Conformance

A Contents list, and self-contained Conformance clause should be provided, if necessary using text from section 6 of ISO/IEC FCD 14651.

In the body of the standard, the text should be revised to include the word "shall" in several instances.

3. Normative detail

The body of the standard (except the default table in section 7) is very slight, and other explanations are left to informative annexes, and not to normative annexes, or to the body of the text. Some Annex text (particularly from Annex A) should be moved to the body of the standard, and/or one or more annexes be made normative, in order to remedy this.

The body of the standard should outline the four basic ordering levels that apply both in prENV 13710 and in ISO/IEC FCD 14651.

Some brief basic text is suggested below (indented):

prENV 13710 uses four levels of ordering.

Level 1: Digits, Latin, Greek, and Cyrillic letters, in that order [1]

Level 2: Diacritics [2]

Level 3: Case: upper case/lower case (or subscript etc., if relevant)

Level 4: Symbols (not shown in the prENV 13710 default table)

If two strings are identical at Level 1, only then is it necessary to compare strings at Level 2, then Level 3, then Level 4 etc., if necessary.

[1] Georgian and Armenian characters would be included in EOR-2.

Non-European scripts are not included in prENV 13710.

[2] For Latin and Greek accented characters, diacritics have significance only at level 2 (following user expectations among the majority of users of Latin script languages in Europe, and users of Greek script). In Cyrillic script, by comparison, accented characters are treated as significant at level 1, (following user expectations in Cyrillic script languages in Europe).

4. Notation (prENV 13710 section 2)

In the default table (prENV 13710 section 7) there are different conventions for the entries than are used in ISO/IEC FCD 14651. The conventions and notations used, and differences from comparable entries in ISO/IEC FCD 14651, should be explained in prENV 13710 section 2.

5. Definitions (prENV 13710 section 4)

The additional definitions in informative Annex A should be combined with those in prENV 13710 section 4. They should be consistent with those in ISO/IEC FCD 14651, and also with other standards of ISO/IEC JTC1 and CEN/TC304 - note definitions used in the CEN/TC304 Fallback draft.

Definitions should be definitions, and not enumerations (as in A.1.1, A.1.3) although enumerations could be provided as examples.

Definitions should also state what things are, not what they are not, as in A.1.3 (Note re Greek "F"); A.1.9; and A.1.10.

6. Tailorability (prENV 13710 section 6)

This section is very brief, and should provide text on how tailoring might work. prENV 13710 also lists some deltas in Annex F, to accommodate ordering in some specific national language traditions. Section 6 (Tailorability) should include some brief text about these tailorings.

Section 6 of prENV 13710 should specify explicitly how symbols are ordered, even if they are not listed in Section 7 itself.

7. Default table (prENV 13710 section 7)

Earlier decisions in CEN/TC304, to include "Levels of letters" as well as "Levels of comparison", superseded by more recent decisions in ISO/IEC FCD 14651 and the Unicode Collation Algorithm, should be overridden. "Levels of letters" are not included in those standards.

This will avoid the current different ordering for Latin letters in prENV 13710 and ISO/IEC FCD 14651.

ISO/IEC FCD 14651 and prENV 13710 differ in the relative placement of unaccented letters; accented letters; digraphs; and modified letters. prENV 13710 should be amended so that digraphs precede modified letters.

LATIN LETTER KRA should be ordered as modified Q, as in the Danish and Greenlandic practice, and as in ISO/IEC FCD 14651 and the Unicode Collation Algorithm.

8. Annex A: Levels (see also section 7.1 above on Levels)

Annex A definitions should be moved to Section 4, as noted above. Variations in usage between definitions of "special character" and "spacing character" in Annex A and Annex B should be resolved, and consistent use and terminology applied throughout the standard.

"Levels of comparison" as in ISO/IEC FCD 14651 should be added to prENV 13710. "Levels of letters" is now an unnecessary distinction (not used in ISO/IEC FCD 14651 or the Unicode Collation Algorithm) and should be removed. Sections A.1.3, A.1.7, and A.8.3, A.4.2.2 Second level letters (within Level 1); A.5 Second level letters; and A.5.3.1 Second level letters (within Level 2), should also be removed, as they are likely to cause unnecessary confusion, through this unsought "Levels of letters" distinction.

9. Tables in Annex A

In Table 1, - Diacritical marks - alternative names should be removed, and ISO/IEC 10646-1 names should be relied upon. In Notes 15/16, unification applies to ordering unification, not character unification: this should be made clear. Notes 15 and 16 also imply a deviation from the small before capitals convention: these entries should be reversed.

Tables 1 - 2 should be on the same page, to enable comparison. It should be pointed out that multiple diacritics derive their ordering from that of single diacritics, which could even avoid the need for Table 2.

In Table 3 - Second level letters - section A.8.3 and the notion of First level letter and Second level letters should be removed, following ISO/IEC FCD 14651 and the Unicode Collation Algorithm.

10. Annex B

Annex B describes the results of word by word ordering. However, the instructions for how to achieve these results are insufficiently detailed.

11. Annexes C and D

Because prENV 13710 is explicitly a profile of ISO/IEC FCD 14651, it should be explicitly pointed out (in the Annexes and in the conformance clause) that the ordering methods described in Annexes C and D are in addition to the normal provisions of both ISO/IEC FCD 14651 and prENV 13710.

These are essentially post-processing/re-ordering operations, following the completion of a conformant ordering operation, and may lead to different ordering to applications which conform solely to the normative parts of prENV 13710.

Annex C: Implementations of ISO/IEC FCD 14651 and prENV 13710 can still be conformant to the normative part of ISO/IEC FCD 14651 or prENV 13710, if Ordering by position and style, described in Annex C, is adopted, as it essentially provides a 5th level of ordering, on strings that are otherwise identical through levels 1-4.

Annex D: Implementations of ISO/IEC FCD 14651 and prENV 13710 which follow Annex D - Implicit or explicit transliteration - result in strings that are different to the normative part of ISO/IEC FCD 14651 or prENV 13710, as additional or replacement characters are inserted at the beginning of each string.

Þorgeir thorgeir@stri.is +354 520 7156