L2/00-355 SC22/WG20 N783 From: Markus Kuhn [Markus.Kuhn@cl.cam.ac.uk] Sent: Thursday, October 12, 2000 6:45 PM Subject: (SC22WG20.3075) Comments on ISO PDTR 14652 Keld Simonsen wrote on 2000-10-12 17:42 UTC: > The current status of the draft is that it exists a PDTR draft available > via http://www.dkuug.dk/jtc1/sc22/WG20/docs/projects Thanks! Some comments on ISO PDTR 14652 from http://anubis.dkuug.dk/jtc1/sc22/WG20/docs/n690.pdf a) In line 1603, please make clear that the repertoire map is optional. In practical implementations such as glibc 2.2, no repertoire maps will be used any more. All characters will be defined exclusively in the form . Repertoire maps are an archaic and obsolete pre-UCS concept that should never lead to mandatory elements of the syntax anywhere. Strings in locales should either be specified in notation for maximum portability, or in UTF-8 for maximum readabiliy. Repertoire maps have nothing practically useful to add to these two options. b) In section 4.3.2.3, the description of the semantics of keywords "default_missing" and "translit_ignore" is incomplete, ambiguous and confusing. I haven't understood what "translit_ignore" is good for. Please don't explain it to me, instead rewrite the document such that there can be no doubt for me how I have to implement this. c) In section 4.3.2, there is at the moment no description of a proper step-by-step algorithm for how transliteration has to be performed according to the data supplied in these keywords (especially "default_missing" and "translit_ignore"). With the current formulation, each implementor will come up with something very different. What does "ignore" mean for example? Substitution with the empty string? Is there any difference between ignoring a character and not providing a transliteration statement for it? (I can suggest one plausible transliteration algorithm, but I'd first like to read what you had in mind originally.) d) Can included transliteration statements redefine previous ones? This is one of the many questions about the unspecified transliteration algorithm that the spec currently does not answer. e) What is "combining" and "combining_level3" good for? These sets seem to be only meaningful in one single coded character set, namely UCS, and there they are hardwired into the respective latest edition of the ISO 10646 standard. There is no cultural dependency at all here, so "combining" and "combining_level3" clearly have no place in a cultural convention specification. They are just fixed properties of a single standard. f) wcwidth() and wcswidth() depend on cultural conventions and transliteration but I haven't seen any provisions for the necessary tables. These would be much more important than "combining" and "combining_level3". g) I section 4.3.2.1, I have great worries about the idea that the string can be more than one character long. This leads to an endless series of implementation problems and should definitely better be dropped. For example, the C99 standard requires all the wide-character to multi-byte conversion (that is where in the C library the transliteration would have to be hooked in) to be equivalent as if done by calls to wcrtomb(). However, wcrtomb() is required to swallow a wide character immediately and spit out the corresponding multi-byte sequence (ISO C99, section 7.24.6.3.3). There is no room for buffering wide characters until it becomes clear what the longest string is at the current position in the wide character stream. The mbstate_t value only keeps state in the sequence of multi-byte characters, not in the sequence of wide characters. Otherwise, the semantics of the file positioning functions would be messed up completely. Please please remove the option of transliterating strings into strings. It sounds neat at first, but clearly wasn't carefully thought through and obviously is not based on implementation experience. Single-character to string transliteration is however no problem at all, because this is very similar to wide-character to multibyte-character conversion and therefore C99 has already all the necessary infrastructure in place. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: