November 23, 1999
on Draft 2 for CEN TRnnnn:1999,
Guide to the use of character set standards in Europe
From: Kenneth Whistler [email@example.com]
This CEN Technical Report is aimed primarily at providing technical guidance to procurement officers in matters related to character sets in IT products. As such, it has a potentially large impact on the wording and content of future procurement specifications originating in Europe, especially since it provides very explicit suggestions for wording related to various circumstances. It is quite likely that such wording will end up verbatim in procurement specifications.
Because of this, it is important that the guidelines provide accurate guidance -- both regarding the technical content and context of the applicable character set standards, and in terms of meaningful boilerplate language for procurement specifications that will lead to procurement of what is intended, rather than to protracted haggling over terminology and/or to misleading responses by vendors in their bids.
While the current draft of the Guide does generally point to the Universal Character Set (UCS), i.e. ISO/IEC 10646 and its sister standard, the Unicode Standard, as the direction that character set standards are headed, and does suggest that procurers take that into account, we believe that the Guide underestimates the extent to which the IT industry has already taken that direction. By not making that situation clear to procurers, it does them a disservice in placing too much emphasis on a character architecture (2022 and its kin) that is basically unsupported by the IT industry. The Guide should make a clear distinction between what is frozen in the past, and where implementations are headed, so that procurers can be in no doubt regarding the exceptional situations where they need to procure IT products compatible with legacy character sets, and where they need to procure IT products that will work into a future in which Unicode becomes ubiquitous for character set interoperability.
The remainder of these comments focus on particular problems in the text of the draft. We believe that these problems should be addressed, so that the Guide may become truly useful, and so that it does not become merely a source of miscommunication and confusion between procurers and vendors.
1. page 1
(and throughout) refers to two annexes which promise to provide "much more detailed, tutorial information." It is unclear where those annexes are -- they do not appear to be part of the document in question.
2. Last paragraph of section 1 Introduction.
This paragraph points to other documents for further information on character sets and their standardization, and in particular to John Clews' "Language automation world-wide" and to Indrek Hein's website. If such sources are included -- especially a pointer to a website -- it would seem self-defeating at best not also to include a reference to http://www.unicode.org, which contains a wealth of information about the Unicode Standard and its relation to ISO/IEC 10646, as well as a whole series of technical reports which have a bearing on the questions at hand. In particular, Unicode Technical Report #17, "Character Encoding Model", has a far more complete and precise specification of the issues and distinctions to be made regarding character set standards than the summary provided in Section 5 of this Guide.
3. Section 3, Scope and field of application,
… points to a major hole in this Guide. "...thus character set standards for non-European languages are not covered." When examined in detail, this means, among other things, that none of the subsets specified and recommended for procurement cover Hebrew or Arabic, while some of them *do* cover Georgian and Armenian. By any reasonable measure, Arabic is a far more important script in France, for example, than is Georgian. IT procurement for France (and other countries in Europe) may well have to take Arabic into account in order to accomplish their intended purposes. But this Guide is completely silent about what procurers should do in such situations. In fact, the entire TC304 approach to the definition of the MES repertoires is very "head-in-the-sand" regarding the place and importance of Middle Eastern scripts in a European IT context, as opposed to Caucasian scripts -- which appear to have been tossed in merely to make the Caucasians (as opposed to the Semites) honorary Europeans, and because the Armenian and Georgian scripts were deigned to be more tractable for European IT equipment.
4. Note under definition for control character:
"strictly spoken" --> "strictly speaking".
5. Definition of code page.
Code page is used synonymically for code table on occasion, but should not be *defined* as a synonym. A code page in the IBM CDRA architecture is the mapping of an abstract character repertoire (the CS, in CDRA terms) to a set of code points. That can be, and often is, done in terms of a database table that has nothing to do with visual representation of the glyphs for characters. It is mostly a matter of convention that code pages are often displayed in code tables. The two are not, definitionally, the same thing.
6. Note to the definition for transliteration.
"In principle, a transliteration should be a one-to-one conversion." This statement is not true. Many transliterations (including those proposed by TC304 in its own fallback tables for rendering of Greek and Cyrillic characters) are one-to-many. And there are also instances where they should be many-to-one. (Polish transliterated in Cyrillic is such a case, where Polish digraphs should be transliterated by single Cyrillic characters.) What may be intended here instead is that transliterations, in principle, should be *reversible* -- that is, that the original form should be obtainable by reversing the transliteration rules. Of course, even this is not true for many actual transliteration systems, which may neutralize distinctions that cannot be recovered by reversing the rules.
7. Definition for fall-back.
The term "output character" is not defined, and is problematical. As ISO/IEC TR 15285 (An operational model for characters and glyphs, cited in Section 5) should make clear, the operational unit for output display (or "rendering") of text is the *glyph*, not the *character*. Using the term "output character", particularly without sufficient clarification, blurs that distinction unnecessarily and creates ambiguities in the guidance for procurement at the "Output" end of the model this TR uses. (More on this topic below.)
8. Definitions for diacritic and for glyph.
Better, more complete definitions of these terms are available in the Glossary of the Unicode Standard.
9. Section 5.2 Coding.
This starts off with the incorrect statement: "In IT systems a character is represented by a 7-bit or 8-bit combination, usually expressed as a numeric code." Since IT systems now also make use of 10646 and Unicode, which represent characters by 16-bit or 32-bit values, the erroneous statement should be amended. See the Unicode Technical Report #17 for a more complete and accurate statement about character encoding.
And the very last paragraph of Section 5.2, which points out that modern IT systems no longer have the earlier restrictions in size, should *explicitly* mention 10646 and Unicode at this point in the discussion. While it is true that legacy character encodings will remain with us for many years, it is already the case that new solutions are starting to encapsulate them all and treat them as alternate, marginal representations in special contexts, rather than as structural backbones for IT configurations. This section should also make clear distinction for procurers between the longevity of *data* and the longevity of *software* that handles that data. The data will be with us for decades, but turnover in software tends to be much faster. And even in those circumstances where legacy software has a long lifetime (as in mission-critical applications running on mainframe computers), it tends to get "encysted" by layered software built on top of it which isolates and protects the rest of distributed, interoperating systems from the peculiarities of the older system.
10. Section 6.1 The input function.
This Guide does not sufficiently distinguish the character repertoire required for input from the input method(s) used for input. The problem is perhaps not so visible for Europe, since most input specification is handled by referring to keyboard standards. But in general the issue is a serious one -- most obviously visible in Asia, where there may be many distinct input methods applicable to the same input repertoire, and vice versa. The reason why even a Guide for European procurement should take this distinction clearly into account is that for the larger European character repertoires, simply extending existing keyboard standards is not going to be sufficient. To deal with MES-3, for example, IT procurers should be aware of, and be specifying, alternative methods of character input -- and not merely be assuming that whatever fallback method of inputting any Unicode character to a system will be appropriate and sufficient for the users of the technology they are acquiring.
11. Section 6.2 The processing function.
"In addition, other information may be associated with each character such as colour, emphasis level and font,..." This refers to an archaic method of handling character attributes (associated with particular terminals and with related early DOS memory-mapped character displays) that has been completely wiped away by the advent of bit-mapped rasterizing displays and modern text-processing models. It would be advisable to merely point out that processing codes for the representation of characters may be distinct from interchange codes, without using inappropriately archaic examples.
Furthermore, it is incorrect to state that "most commercially available computer systems do not use standardized character sets for internal representation of character data." Microsoft Windows, as well as many systems from IBM, Apple, and other major software vendors, now make *extensive* internal use of Unicode as their processing codes. Since the Unicode Standard is conformant to ISO/IEC 10646, this internal processing usage is indistinguishable from a use of 10646 as an internal processing code. Even Unix systems are now making extensive use of Unicode (in its UTF-8 form, or expressed as a processing code indistinguishable from the UCS-4 form of 10646) for internal representations. And where the Unix systems make use of 8-bit character sets, they tend to use the ISO 8859 series of standards now for processing, rather than proprietary vendor specifications.
12. Section 6.3 The interchange function.
The statement that "Thus a character set for interchange is needed which will have to be different from one or both of the processing character sets," is also generally not the case. This follows for two reasons. First, many vendors, as just pointed out, *are* making use of official standards for their processing code. This is most obviously true for implementations making use of Unicode, in which Unicode is the processing code, and then is also used as the interchange code. But this also follows because there are numerous protocols for interchange that allow identification and standard interchange of data using vendor character sets, as well as that using de jure standards. The most obvious of these include MIME tags for mail interchange, and the charset tags for HTML. Use any browser and check the character set options. They will include a number of vendor standard character sets that are used worldwide for the exchange of HTML pages. Windows character sets are often in more widespread use for open data interchange than many international character encoding standards. And we are not talking about "internal (proprietary) networks" (cf. Section 7) but the Internet. You cannot get much more open and universal a network than that!
12a. The last paragraph of Section 6.3:
"Such requirements may also include policy decisions of a more technical nature, e.g. as to what code structure to use." This is all rather vague, and does not actually provide much guidance to a procurer. It would be far better to suggest use of Unicode (or 10646, if you prefer), with choice between UTF-8 versus UTF-16 to be sensitive to the nature of the application(s) and their interfaces. That would more closely match the actual decisions that application procurers have to make.
13. Section 6.4 The output function.
This section again makes a misleading reference to "output character sets", and refers to the outdated model of terminal and memory-mapped character displays for characters. What is actually needed at this point is some guidance for procurers regarding the procurement of *fonts* that are sufficient for appropriate rendering of the character repertoire desired for processing and interchange. This should be clearly stated in terms of fonts (rather than the "output function of characters"), since the real procurement issues that will need to be decided *do* generally come down to ensuring that an appropriate set of fonts is available for display. Only in instances where some backward legacy compatibility is at issue will true issues of "output function of characters" be involved, and even in many of those cases, such behavior in modern systems will simply be emulated in bitmap graphic displays using special fonts.
14. Section 7, ff.
This document makes a number of references to "UNICODE" and "the UNICODE consortium". This is incorrect usage. The term "Unicode" is trademarked and should always be shown title-cased -- never in all caps, as if it were an acronym. Correct usage is as follows:
the California-domiciled non-profit corporation: Unicode, Inc. the name of the vendor consortium: the Unicode Consortium the name of the standard: the Unicode Standard
15. Section 7, * Manufacturer standards.
Note once again that many vendor character encodings *are* widely used in open interchange on the Internet -- not merely in internal (proprietary) networks. This is particularly true of but not limited to Windows code pages. Procurers would ignore this fact at their peril.
In general, this entire TR, focused as it is on ISO and CEN standards that bear on European character requirements, is missing the extremely important international context that the Internet has now placed all procurement decisions in. The model for software deployment and application structuring is undergoing a revolution now, and that has everything to do with distributed availability on the Internet and the Worldwide Web. Any well-informed procurement decision of IT technology these days should be taking the Internet and the Worldwide Web into account -- and in those domains now, Unicode is being used to tie the protocols, standards, and layers together for interoperability.
Counseling European procurers to root around in ISO 10367 while effectively ignoring the impact of Internet protocols and the Web on character set usage and choice verges on the irresponsible.
16. Section 7.2 Manufacturer standard.
The first paragraph makes the claim "All use 8-bit codes," which is flat-out not true if the Unicode Standard is to be considered a manufacturer standard, as claimed in Section 7.
The last paragraph of this section points to K. I. Larsson's publication for mappings between various official and manufacturer standards. The TR should also point to the Unicode website, which contains extensive mapping tables between vendor character sets and Unicode -- including the mapping tables officially maintained by, and sanctioned by, Microsoft and Apple for their vendor character sets.
17. Section 7.3 Related Standards.
The paragraph of Ordering standards should also mention the Unicode Technical Report #10, Unicode Collation Algorithm. That is the ordering standard sanctioned by the Unicode Consortium. It is designed to be conformant to ISO 14651, and will be seeing widespread implementation in the IT industry for string ordering and string matching.
18. Section 8 International character sets.
First paragraph states, "For processing, most IT systems use proprietary manufacturer standards." This is no longer really the case, given that most manufacturers have already moved to Unicode, or are moving to Unicode as fast as they can. Since Unicode is conformant to 10646, this is tantamount to using the international UCS standard for processing. Yes, the manufacturers all continue to provide support for their own legacy encodings, and will continue to do so indefinitely, but this is different from a bald claim that their usage is proprietary.
This misrepresentation is particularly egregious in section 8.1, since this section introduces the ISO 2022/4873/12070/2375 framework for interworking 7-bit and 8-bit character sets, with the implication that the 2022 framework is the answer for interoperable exchange involving all those proprietary manufacturer character sets (!). To the contrary, it is perfectly clear that the entire IT industry is walking away from this framework as fast as it can, and is embracing Unicode/10646 as the solution for character set interoperability. The Guide *should* explain 2022, etc., but it should also be providing stronger guidance that puts this framework in its historical place and points out that new procurements should not be specifying it, except under the most explicit and narrow of circumstances.
19. Section 8.2 7- and 8-bit character set standards.
In the discussion of 8859 claims that "There are currently 15 parts to ISO/IEC 8859,..." That is not the case. Parts 11 and 12 do not exist, so there are 13 parts, numbered 1..10, 13..15. This mistake is repeated in the discussion of ISO/IEC 10367.
A more serious problem is that the discussion of ISO/IEC 10367 does not strongly deprecate that standard for procurement. We feel that 10367 should *not* be used for procurement. It combines the usefulness of the simple 8859 standards with the bad (and mostly unimplemented) code extension techniques of 2022. It is not implemented, and procurers should not be misled into thinking they should use it as the basis for a procurement specification for a larger European repertoire, when Unicode 10646 is the clear, widely implemented alternative.
20. Section 8.3.1. The Basic Multilingual Plane.
The guidance paragraph states that "there is evidence that the UCS will be used for the processing function as well." This could be stated much more strongly. The UCS clearly and obviously has been used as a processing code in many systems for many years now. There is no need to pussyfoot around that fact.
The guidance goes on to state that "However, 7- and 8-bit systems will continue to exist for some considerable time, possibly for up to 25 years." Once again, this is failing to make adequate distinctions between *data* and *software*. Microsoft will be aggressively moving Windows systems to Windows 2000, which is "all Unicode, all the time." And it will not be much more than 5 years before all the prior versions of Windows become so "crusty" in the context of ongoing software and hardware development that they become all but unusable, the way old MS-DOS systems currently are. Yes, the existing data, especially the vast amounts in databases, will live on for a very long time, but the procurement issues based around supporting that data are different than the procurement issues based around software interoperability and end-user character repertoire support.
21. Section 8.3.2. states that "A procedure has been agreed for the submission of proposals for the addition of subsets to the base standard." This is true, but misleading. It is misleading because it suggests that formal subsets in 10646 solve the problem of what characters to support in any interesting way. We have no quibble with the need to European procurers to specify minimal sets of characters from a repertoire that they feel would be adequate to support their application needs. But it is unlikely that registration of these as formal subsets in 10646 will help much in meeting those needs. This problem is aggravated by the fact that the actual subsets currently being specified by TC304 (cf. Section 9.2 of this document) contain a number of severe technical flaws that render them incoherent as self-contained subsets. In point of fact, most vendors will end up analyzing the MES repertoires, produce implementations that cover the behavior that they deduce that the European procurers think they want based on the intent of those repertoires, and then will claim to meet the spirit of the procurement specification, even though internally, they will not in fact actually match the actual list of any of the MES repertoires. This is a case where a detailed (but inconsistent and flawed) specification of a repertoire is less helpful to procurement than would have been a general specification of the intent of coverage for a repertoire.
22. Section 8.3.3. Unicode.
We agree that The Unicode Standard (note, not "UNICODE standard") does play an important role from the point of view of procurement. The versions of the Unicode Standard are much more precise and verifiable as to exact repertoire and content than is 10646 plus some number of amendments. We think that the Guide should be more explicit here and point directly to the versions page of the Unicode website:
which provides exactly the kind of information, in detail, that would help a procurer be precise in specifying an exact level of the UCS to support.
By the way, the publication date for the Unicode Standard, Version 3.0 (not "3") is now 2000, not 1999. For the References, the exact citation is:
The Unicode Consortium, The Unicode Standard, Version 3.0. Reading, MA, Addison Wesley, 2000. ISBN 0-201-61633-5.
23. Section 9.1 8-bit character sets.
The use of EN 1923 should be deprecated here. Support of LL8 via ISO 10367 is a *bad* choice for procurement.
24. Section 9.2 16-bit character sets- the MES's.
Insufficient guidance is provided here for the use or specification of MES-2 and MES-3. None of the pitfalls and drawbacks of MES-2 and MES-3, including the fact that neither of these subsets is closed under normalization, is pointed out to the potential procurer. Since both of these subsets include combining marks and imply the use of 10646 at Level 3 (full support for combining marks), issues such as normalization should not be ignored here. It is quite evident that procurement specifications involving MES-2 and MES-3 are going to lead to more procurement difficulties than they solve.
25. Section 9.3 The EURO SIGN.
The specification of ISO-IR 204, 205, and 206 should *not* be encouraged *at all* by this TR. Those represent non-standard registered character sets that replace one character in standard character sets with the EURO SIGN. These are not implemented by those who implement the 8859 standards that these registrations are based on, and encouraging their specification in procurements is just another invitation to data corruption.
26. Section 10.1.2. Procurement issues:
Code structure for the interchange function. This section clearly should mention the Internet and its impact on this decision.
27. Section 10.1.3. Procurement issues:
Repertoire for the interchange function. We take issue with the recommendations here. The BL repertoire is not sufficient for French. (As is obvious from the hardball standards politics which went into adding characters for French in 8859-15.) The use of LL8 should clearly be deprecated, because of its close tie to an outmoded mechanism for character encoding. And the use of MES-2 and MES-3 are problematical for interchange because of their defective contents. The only uncontroversial guidance here is "The repertoire selection indicated above is a minimum requirement. The supplier may go further." In fact, most IT implementations that intend European coverage *will* go further, simply because it is easier and more consistent than trying to stick to MES-2 or MES-3.
28. Section 10.2.1. Interfaces.
Input to Processing. This subsection is unclear. The phrase "will use a transformation mechanism to allow the use to identify extra characters for input," doesn't make much sense, even with the example that follows it. Please rewrite this section to indicate clearly what its intent is.
29. Section 10.2.2 Transformation functions.
This section repeats the unfounded basis for the choice of which five scripts should be supported for Europe (including Armenian and Georgian, but omitting Arabic!). There is also an irrelevancy in the 5th paragraph, which points out that some transformations (such as é --> e/) allow the identity of the original character to be deduced, but "can cause problems with tabular formation of information." *Any* transformation of data -- transliterations or other, data-preserving, reversible, or not -- can cause problems with tabular formation of information. So what? How does that impact a procurer's decision in any material way?
30. Section 10.3 is missing. Presumably 10.4 should be renumbered.
31. Section 10.4.1.
This would be a reasonable place to further discourage the use of 2022. This is also where other mechanisms of character set identification that *are* in widespread use (by Internet protocols such as http) should be clarified for the procurer.
32. 10.4.2 UCS interchange code structure.
The specification of format of the code exchanged is insufficiently precise here. Nearly all implementations of 10646 are in fact Unicode implementations. And they will be using one of the following character encoding schemes (CES's -- see UTR #17) for interchange: UTF-8, UTF-16BE, or UTF-16LE. The latter two forms explicitly spell out the order in which the bytes of 16-bit characters are sent, so they encompass the 3rd bullet already.
Unicode implementations generally do not make any use of the formal subset identification mechanism of 10646. Any identification of subsets would be strictly by a priori agreement.
Unicode implementations, to be conformant to the Unicode Standard, are by definition Level 3 implementations of 10646. Therefore, specification of level for procurement of IT technology that depends on Unicode is basically beside the point.
In the discussion of signatures, correct the following: "A signature is a sequence of octets sent at the start of an interchange..." to "A signature is a particular sequence of octets sent at the start of an interchange..." The point is that the exact bytes involved are defined by the standards, and are not negotiated or determined in any way by the vendors or applications.
The guidance paragraph states, "The procurer must choose between such a product [making use of an a priori agreement to determine code structure parameters] and a more general purpose UCS application, in which case broader solutions to this problem must be sought." This is rather vague, and in our opinion not good advice to a procurer. The implication is that a "more general purpose UCS application" will be making use of the Escape codes identification mechanism to indicate subset, level, and form support. But what would those Escape code sequences be used with? -- some non-UCS application that doesn't support any of the UCS semantics, except for being able to parse 2022 Escape code sequences? Interoperability between Unicode applications doesn't make use of these mechanisms, and procurers are unlikely to find such "broader solutions" available, since it is quite unclear what problem they are solving and whether anyone would be providing them to the market.
33. Section 11 Procurement Clauses
In general, we view it as a step forward that these suggested wordings for procurement are stated in terms of character repertoire instead of character encodings. That is, indeed, the more relevant concern for a procurer: does the IT technology in question provide sufficient support for the repertoire of characters that the users of that technology will need?
However, there are some specific defects in the recommendations here. As noted above, it is our opinion that LL8 should also be deprecated. ISO IR 204, 205, and 206 should be deprecated. And there are technical issues with the repertoires of CId(MES-2), CId(MES-3A), and CId(MES-3B) that will render them problematical for procurement.
Generally, recommendations for repertoire support should be stated in terms like: "The product shall support *at least* the xxx repertoire(s) yyyy specified in zzzz." This would make it clear that the procurement specifications are for minimal coverage, but do not preclude systems that go considerably beyond the minimal coverage. By taking such an approach, procurement issues could be simplified, since nearly every system declaring Unicode support will incorporate support for all of the European characters and go well beyond them. Procurement issues involving Unicode-based systems would then devolve to a determination that appropriate input methods (keyboards and others, if applicable) were available, and that appropriate fonts and rendering methods were available to meet the input and display/printing requirements of the procurement.
34. Section 1.3 Output character repertoire.
As stated before, the concept of "output character repertoire" is tied to antiquated technology. For most modern IT procurement, it should be supplemented or replaced by detailed guidelines regarding *font* procurement. What is at issue is guaranteeing that appropriate font support is available for output rendering of the repertoire of characters that is otherwise required for processing and interchange support.
35. Section 11.3.4 Fall-back and other output transformation functions.
This section is overly strict and not clear enough. Effectively, what the Guide is trying to do is provide wording for procurers to require under fallback conditions either that: a. a character be represented visibly in some way to indicate that a fallback has occurred, or b. that in addition the fallback rendering be such that the end user has a way of determining what the original character was. The first option is typically implemented with the fallback glyph (e.g. a black box or open box, etc., on display). The second option could be implemented by substituting an entity name, or "\u4e00", etc., for an undisplayable glyph.
The problem is with the wording "should be represented in such a way as to indicate that it is not the original character." With a fallback glyph this is reasonably easy to accomplish. But TC304 does not mean merely such mechanisms for fallback, as is indicated by the rather elaborate fallback tables that TC304 is preparing in conjunction with the specification of the MES's. And with the use of such fallbacks (e.g. substitution of "ss" for s-hacek or ß, etc.) it is very difficult, in general, in the output to determine whether the fallback is the original character or not. Nothing prevents arbitrary character fallbacks from having been in the original data. E-mail is a good example of a medium where authors have often been known to type in fallbacks directly, to prevent the mangling of characters in transfer.
This entire section should be thought through more completely, so as to distinguish between automatic fallback mechanisms and the use of arbitrarily complex (reversible or not) fallback mechanisms for the representation of undisplayable characters. The clauses as currently worded are just invitations to quibbling and misunderstandings in procurement.
36. Section 11.5.4 Fall-back and other transformation functions (Interchange)
This section suffers from some of the same problems as for output fallback. However, in interchange, the automatic fallback is generally to the general substitution character (or to a user-specified character, typically, "?").
This section makes an unnecessary distinction between processing and interchange, and talks about a kind of impedance mismatch between a processing repertoire and an interchange repertoire.
A much clearer way to characterize this -- and a way that all the vendors will understand would be as follows:
In interchange, *require* the lossless sending and receiving of character data. The Unicode Standard requires this for conformance. Whether a receiver *interprets* a character it receives is up to it (and can be characterized in terms of the "processing repertoire" that the receiver supports).
Then specify the requirements on mapping between different character repertoires. If a receiving process supports less than the sending process, a fallback representation may be required. It is at that point that it makes sense to distinguish between the types of fallback required and whether reversible transforms to obtain the original character representation are needed.
37. Section 11.7.1 Levels and coding form.
This section is attempting to follow the tried and true dictum to be conservative in what you send and liberal in what you receive. However, the wording is completely bollixed up for how 10646/Unicode applications work.
The distinction between UCS-2 and UCS-4 is *not* a matter of conservative versus liberal support. So the clause:
"For receiving, the product shall support the level-3 operation using at least the UCS-2 form, the UCS-4 form and the UTF-8 transformation format as specified in ISO/IEC 10646-1:1993."
… is incoherent. Unicode implementations *may* support both UTF-16 and UTF-8, but, at least for interchange, they do *not* support UCS-4. And for conformant interchange, it is only required that the sending process and the receiving process support the same form. Furthermore, a Unicode process that attempted to support UCS-2 (but not UTF-16) would *not* be conformant to the Unicode Standard.
The main point here is that UCS-4 is not somehow *more* than UCS-2, in any way that makes sense in the clause cited above.
38. Section 11.7.2 Ordering of octets
This section also bollixes up the specification of ordering of octets, with clauses requiring that a sender "shall support at least the normal ordering of octets..." and that a receiver "shall support both the normal ordering and the reverse ordering of octets..." The way it *should* work is that a sender should be required to support *either* UTF-16BE or UTF-16LE or UTF-8. A receiver should be required to support one of these, as appropriate for the context in which it is operating. If it must operate in an open environment, in which it may encounter any of those 3 forms, *then* it may be required to support all three.
39. Section 11.7.3 Signatures
"For receiving, the product shall support the use of all signatures as specified in ISO/IEC 10646-1:1993."
This wording alone, if interpreted legalistically, would preclude any product conformant to the Unicode Standard from meeting a procurement requirement that was attempting to get UCS character repertoire support. The problem is that some signatures are specified for use with UCS-4, which is *not* a conformant form of the Unicode Standard.
Once again, this is a misapplication of the interoperability dictum.
All of this verbiage would be enormously simplified if the UCS requirements for procurement were simply replaced by:
"The product shall be conformant to the Unicode Standard."
That would gain the procurer an enormous additional increment in the required level of interoperability, while avoiding the incoherent attempt to specify levels, encoding forms, ordering of octets, and use of signatures based on 10646 outside the context of the Unicode Standard.
In fact, procurers should be warned *against* products that claim conformance to 10646 but which explicitly claim *not* to be conforming also to the Unicode Standard. That is also a guarantee that there will be interoperability problems between that product and the vast majority of UCS implementations that *do* attempt to follow all the additional prescripts of the Unicode Standard.
We realize that TC304 is operating under a mandate to provide guidance for ISO and European standards specifically -- so that TC304 may find it difficult to include in this TR a procurement recommendation for conformance to a vendor standard, even one with such obviously universal applicability as the Unicode Standard. But it is quite predictable that the procurement clauses specified in Section 11.7 of the TR will cause problems for procurement, rather than simplifying and guiding the process.
End of comments