More comments on the CEN guide, v4

L2/99-375

SC22/WG20 N729

From: Kenneth Whistler

Date: Tuesday, November 30, 1999

Subject: More comments on CEN TRnnnn:1999 Guide

Since I sent around my last set of comments that Arnold made into an L2 and WG20 document, my sources have made available Draft 4 of the Guide, dated 1999-11-22. This version has been updated some from Draft 3, and also includes the two Annexes that I could not find for Draft 3.

Some of the minor issues I raised before have been addressed in the latest draft, although most of the big issues have not. And there are a couple of new problems that have been introduced.

Herewith my remaining comments on the Guide itself, plus a few comments on the Annexes—particularly Annex B, which deals with the UCS.

--Ken

=============================================================

Additional Comments on CEN TRnnnn:1999 Guide to the use of character set standards in Europe.

The following comments apply to Draft 4 of this document, including the Annexes, and should be considered as supplementary to the previous set of comments from L2.

Table of Contents

There is a minor problem in the generation of the Table of Contents. Sections 5.3 and 6.1 are omitted.

Section 6.2.2 Character classification

This new section baldly states that the “main specification in ISO [of the classification of characters] is ISO/IEC TR 14652...” This statement neglects the fact that the section of 14652 in question is tremendously controversial within the relevant ISO committee (SC22/WG20), with serious and principled dissension among several NB members of that committee. The LC_CTYPE specification in 14652 is one of the main reasons why 14652 was, in fact, moved from being a standard in development to being a TR in development, since no consensus could be obtained on how to proceed with that portion of the document. Furthermore, 14652 has not yet progressed beyond the PDTR stage, with resolution of ballot comments still under way.

This section mentions, in an offhand way, the fact that the Unicode Consortium has also specified character classifications, but neglects to point out that those classifications are in de facto use by nearly all Unicode implementations—which means the vast majority of 10646 implementations.

It is a disservice to European procurers to be pointing them at one controversial (and not yet completed) set of character classifications being flogged, in particular, by the Norwegian committee for use in Linux, just because it has an ISO label on it, while not pointing out the actual usage that is being made of the Unicode character classifications for all other major OS’s (Microsoft, Apple, IBM, ...), major applications, and all major databases, as well as Java.

Section 13.1 ISO/IEC standards

ISO/IEC 14652 has been added to this section as a “DIS”, which it is not. It is currently a PDTR, and even assuming it passes its next ballot (sometime next summer?), it will be a TR, and not a standard.

ISO/IEC 15435 has also been added to this section. That document has been through three working drafts now, and is still in terrible shape, with no committee consensus to move it to CD status. It should certainly be removed from the list of references from this Guide. It has no status or use for procurers at this point.

Annex A: 8-Bit Character Sets

In general, this Annex is an admirable attempt to explain the entire context of ISO/IEC 2022 and the standards associated with it. It is certainly the clearest and most coherent summary of how all these standards fit together that has come to our attention.

However, in the context of a Guide for procurement, Annex A is somewhat misleading in not sufficiently identifying those parts of the entire 2022 framework that are not widely implemented or not implemented at all. Effectively, only the series of 8859 parts have seen very widespread, consensus implementation. All other aspects of this framework for 8-bit character sets have niche implementations, from the point of view of the main directions in the IT industry. Some of these standards, such as ISO/IEC 10538, are completely failed standards. And much of the 2022 framework, designed for use in OSI, is now generally ignored.

The Guide would be well-served if the judgement from section 8.3.1, “The UCS represents the future direction for coded character sets and is being implemented by a wide range of suppliers,” were carried forward into a judicious assessment of which parts of the 2022 framework are already obsolescent and destined to marginal legacy support. As it stands, the Annex is much too “kind” to most of these existing standards, neutrally describing the complexities of the CONTROL FUNCTION INTRODUCER and IDENTIFY GRAPHIC SUBREPERTOIRE, for example, without the technical guidance of Section 5 truly taking a stand about whether this kind of stuff should be avoided.

Section 6.1.3 ISO 2375

This section should probably point out that a revision of 2375 is currently under ballot. This revision will likely modify the requirements for registration somewhat.

Section 6.1.11.3 Tutorial guidance for ISO/IEC 10646 makes the statement, “In its [10646] full form a four-octet coding (32 bits) will be required,...” This is somewhat ambiguous, depending on what is meant by “full form”, but basically, 10646 does not require a four-octet coding; it allows a four-octet coding. The wording here should be modified to avoid giving a misleading impression regarding what 10646 requires.

Annex B: The Universal Character Set (UCS)

This section also provides a rather thorough and up-to-date summary of 10646. There are, however, a few discrepancies and errors in the text that should be corrected.

Section 1.2 The UCS and UNICODE [sic]

As for elsewhere in the Guide, the term “Unicode” should be shown that way, and never in all-caps. It is not an acronym, but a trademarked name. The first citation should show the “TM” symbol, but thereafter in the document, the “TM” can be left off. Please refer to:

http://www.unicode.org/unicode/consortium/logo.html

for information about the proper usage of the mark.

The last paragraph of this section refers twice to “14652”. Presumably the first citation is intended to be a citation of “14651”, the ordering standard.

Section 2.1 Characters, character names and glyphs

In the 3^rd from the last paragraph, there is a claim that digits “may only be used in the names of ideographic characters”. Actually, the name guidelines are a little looser than that, and there are four punctuation character names that have a “9” in their names.

Section 2.2 Graphic characters and control characters.

In referring to protocols that “separate the control data from the character data”, this section uses ASN.1 as an example. That is not a very good example. What this section is talking about are what are generally called markup languages or document description languages, such as HTML and PDF. Those two would be far more appropriate examples to use—particularly since they are used and sanctioned by ISO and CEN to post their own documents!

Section 2.3 Alphabetic, syllabic and ideographic scripts

There are some errors here, where the text has not caught up with the second edition of 10646-1, or the CD for 10646-2.

“..somewhat over one quarter of its code space in the BMP, a

total of 20992 code positions, “ --> “..somewhat over 42% of its

code space in the BMP, a total of 27,484 code positions, “

“This would leave somewhat over one half of the code space of the BMP...” ==> “...somewhat less than one half ...”

“is allocating new scripts and ideographic characters to the second plane of the code space.” ==> “ ... to the first and second planes of the code space.”

Section 2.4 Sequence order and writing mode.

The first paragraph claims there are 3 arrangements in common use. While depending on the use of the term “common” this may be true, the text neglects to point out that the standard practice for Mongolian (now encoded in 10646-1, second edition) is top-to-bottom layout with successive rows being written left-to-right (the opposite of vertical CJK layout).

The second paragraph talks about bidirectional layout (without actually calling it that), and emphasizes the 6429 directional controls (SELECT PRESENTATION DIRECTIONS, START REVERSED STRING), without sufficiently deprecating their use. It points out that “10646-1 specified in annex F alternative ways of managing the direction of presentation with embedded characters and these are preferred,” while neglecting to note that those embedded characters are intended for use with a implicit bidirectional algorithm that can work without any such embedded controls (except for overriding behavior in exceptional circumstances). The appropriate document to point to here, is, of course, the Unicode Standard, which details the bidirectional algorithm that is actually being used in shipping systems today to support Hebrew and Arabic in Unicode (=10646) implementations.

Section 3.1 Uniform and mixed codes.

The statement that “the combining characters of the UCS are characters in their own right but they do not have a visual representation on their own,” is incorrect. It is true that they are characters. But they also have associated representative glyphs, as for most other graphical characters. The glyphs may or may not be distinct entities in fonts, but they often are. The only difference for most of the non-spacing marks is that their representative glyphs have a large negative side-bearing. But that isn’t even true of many of the combining characters for Indic, which may be fully spacing in display.

Section 3.4 The four-octet structure of the UCS.

“...the commercially-developed UNICODE^TM, which was developed strictly as a two-octet code.”

Correct the name to “Unicode”, and insert “originally” in front of “developed”. Now that the Unicode Standard officially sanctions both UTF-16 and UTF-8 forms, it is now neither strictly fixed-width nor strictly two-octet.

Section 4.5 The Hangul syllabics and Yi

In the third paragraph, insert the missing word “Yi” in front of “syllables.”

Section 4.6 The remaining rows of the BMP

In the discussion of private use, the statement, “...but the block for private use consists, by its very nature, only of unallocated code positions,” is somewhat misleading. The use of the term “allocate” in this Annex is being used to imply the existence of an encoded character that makes use of that code point, but by designating a code position as private use, the standard is effectively allocating a function to that position—though not a particular character identity. Part of the problem with using the term “unallocated” in this way is that it could imply that a character could be “allocated” there by the standards committee in the future. But that is exactly what the users of private use code positions do not want to have happen, since it would interfere with their private use.

Section 5.1 Combining and non-combining characters

The statement in the fourth paragraph, “Combining characters are not an essential part of the coding of the Latin script,” is just wrong. They are an essential part of the coding. They are present in 10646 exactly to deal with all the extensions and diacritic usages in the Latin script that cannot be handled by enumerating a closed repertoire of precomposed Latin characters. This paragraph and the following discussion are trying to argue a position about what characters SC2/WG2 should encode—a position that is neither a consensus position within WG2 nor really in the scope of this Annex’s presentation of 10646. The argumentation should be removed.

In the 7^th paragraph, the statement, “10646-1 defines three distinct levels of implementation in which either none (level 1), or some (level 2), or all (level 3) of the combining characters of the UCS are permitted to be encoded,” is wrong. The combining characters of the UCS are encoded. What the levels represent are constraints on the use of certain sets of combining characters in data that purports to be conformant at that level.

In the last paragraph, the statement, “It is the general intention of the UCS that for most purposes it will not be necessary to use a level 3 implementation but that the choice between levels 1 and 2 will depend on the character collections to be used,” is also wrong. The UCS does not have intentions. It was the intent of certain of the drafters of the text of 10646 that the mechanism of levels would make it easier to specify implementation levels that did not require the use of combining marks, since they considered combining marks complicated and a barrier to implementation. However, if one looks at actual claims of conformance to 10646, the overwhelmingly significant fact is that the Unicode Standard claims conformance to 10646 with implementation level 3, and nearly all implementations claim conformance to the Unicode Standard. This has not been a barrier to implementations, in practice.

Section 6.2 Naming guidelines of the UCS

The entire section regarding naming problems (LATIN CAPITAL LETTER AE, MUSIC NOTE versus EIGHTH NOTE, LATIN SMALL LETTER G WITH CEDILLA, etc.) is an argumentative rehashing of what ought to be dead issues of no great import. Continuing these arguments serves little purpose. But if the arguments have to be continued, then the claim (regarding the change from LIGATURE AE --> LETTER AE) that “the Technical Corrigendum changed the characters allocated to the six code positions affected, it did not rename six characters,” is just utter baloney. No one in the implementing industry treated this as a change in encoded characters. No mappings changed, no properties changed, no implementations changed—only the names changed in a few central databases.

Section 7.2 Collections and subsets

In the 7^th paragraph, the statement that “it is much more concise to list collections rather than individual characters,” should be balanced by the caution that many of the collections listed in Annex A of 10646-1 are not coherent collections for actual implementations. For most implementations of scripts, a more precise control over the list of supported characters is needed in order to provide the end-users with the range of textual behavior they demand.

Section 8.4 UTF-16

In the second paragraph, the claim that “UTF-16 has been designed to avoid this halving of capacity,” is rewriting history somewhat. UTF-16 was designed as a code extension mechanism for 16-bit implementations of 10646. If halving the data transfer rate through a communication link or for storage were the main issue, that could have been addressed easily with compression mechanisms, rather than new encoding forms. The real issue was the cost of changing the width of the character datatype for existing API’s and implementations.