L2/01-135 From: Sandra O'donnell USG [odonnell@zk3.dec.com] Sent: Thursday, April 05, 2001 6:40 AM Subject: Draft comments to DTR ballot of ISO/IEC TR 14652 Document: SC22/WG20 N822; SC22 N3227, L2/01-151 OBJECTION #1 Section 4.1.4.1 comment_char (lines 652-653, and affecting the FDCC-set definition) Current text: "The comment character defaults to the number_sign "#". All examples in this Technical Report uses "%" as the comment character, except where otherwise noted." Problem and Action: ISO/IEC 9945-2:1992 (POSIX.2) uses the default comment_char, and for consistency with existing practice, this document should as well. Change the sentence "All examples..." to "All examples in this Technical Report use the default comment character." Also, revise the FDCC-set definition. OBJECTION #2 Section 4.1.4.2 escape_char (lines 666-667, and affecting the FDCC-set definition) Current text: "The escape character defaults to backslash "\". All examples in this Technical Report uses "/" as the escape character, except where otherwise noted." Problem and Action: ISO/IEC 9945-2:1992 (POSIX.2) uses the default escape_char, and for consistency with existing practice, this document should as well. Change the sentence "All examples..." to "All examples in this Technical Report use the default escape character." Also, revise the FDCC-set definition. OBJECTION #3 Section 4.2 LC_IDENTIFICATION (lines 698-777) Problem: The text defines a list of properties for an FDCC-set, and states that "All keywords are mandatory unless otherwise noted." (lines 701-702) However, at lines 728-729, it states "If information required for any of the mandatory keywords above is not available, then the corresponding string is an empty string." Further, the i18n LC_IDENTIFICATION section defined at lines 748-777 contains empty strings for six `mandatory' keywords. This is confusing. What the text is trying to say is that certain keywords must be present, as opposed to requiring that values be assigned to certain keywords. But when most people think of "mandatory", they think of it in terms of values, not keywords. Besides, what is the rationale of requiring that certain keywords be present, but NOT requiring that they include a value? If values are not required, they are not mandatory. Action: Make the following changes. 1. Change the sentence "All keywords are mandatory..." to "Values must be supplied for all keywords, unless otherwise noted." 2. Add the sentence "This keyword is optional." to the description of keywords email, tel, fax, language, and territory. 3. Remove the sentence at lines 728-729 ("If information required for any of the mandatory keywords..."). OBJECTION #4 Section 4.3 LC_CTYPE (lines 787-788 and 817-821 and affecting Section 4.3.2 "i18n" LC_CTYPE category) Current wording: "The double increment hexadecimal symbolic ellipses ("..(2)..") works like the hexadecimal symbolic ellipses, but generates only every other of the symbolic character names. As an example. ..(2).. is interpreted as the symbolic character names , , , and , in that order." Problem: This type of symbolic ellipses allows an FDCC-set author to save a little typing for some scripts if letters for those scripts are arranged in a code set in uppercase/lowercase pairs. Using this type of ellipses, the author can indicate a start and end point for a range, and pick up every other entry. The problem is that this is extremely confusing, especially considering that there already are three other types of ellipses. It will be extremely easy for authors to make mistakes, and difficult to implement and maintain all these variations. The work saved by adding this type of ellipses is overshadowed by the implementation, maintenance, readability, and potential for mistakes that it adds. Action: Remove lines 817-821. Remove the reference to double increment hexadecimal symbolic ellipses in lines 787-788. Change the entries in Section 4.3.2 to eliminate usage of this type of ellipses. OBJECTION #5 Section 4.3.1 Character classification keywords (line 834) Problem and Action: Grammar; change existing text to "...the interpreting system provides them if missing and accepts them silently..." OBJECTION #6 Section 4.3.1 Character classification keywords (lines 855-857) Current wording for digit class: "Define the characters to be classified as numeric digits. Digits corresponding to the values 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 can be specified in groups of 10 digits,..." Problem: The text was not quite accurate in POSIX.2, and it definitely is not accurate here. The first sentence is copied from POSIX.2, but in that standard, *only* the portable digits 0-9 could be specified. This proposal extends the definition, but only allows decimal digits. The restriction should be spelled out. Action: Change the first sentence to "Define the characters to be classified as decimal digits." OBJECTION #7 Section 4.3.1 Character classification keywords (line 867) Problem and Action: Incorrect class name; change "digits" to "digit". OBJECTION #8 Section 4.3.1 Character classification keywords (lines 869-878) Current wording for "outdigit" class: "Define the characters to be classified as numeric digits for output from an application, such as to a printer or a display or a output text file. Digits corresponding to the values <0>, <1>, <2>, <3>, <4>, <5>, <6>, <7>, <8>, and <9> can be specified, and in ascending order of the values they represent. The intended use is for all places where digits are used for output, including numeric and monetary formatting, and date and time formatting. Only one set of 10 digits may be specified. If this keyword is not specified, the digits 0 through 9 of the portable character set automatically belong to this class, with application-defined character values..." Problem: This keyword as defined is insufficient for its stated use. Assume someone wants to define Roman numerals for use in dates. Since only the values 0-9 can be specified, there is no way to list the Roman numerals X, XI, and XII for the 10th-12th months. Or suppose someone wants to write Chinese monetary values. There is a single character for "ten", a single character for "hundred", and so on. To express 10, you use the "ten" character; to express 20, you use the "two" character plus the "ten" character (two 10s). The outdigit keyword does not allow for the Chinese "ten" or "hundred" (and so on) characters, and so does not fulfill the intended use for "all places where digits are used for output, including numeric and monetary..." Action: Remove this keyword since it does not satisfy the stated need. OBJECTION #9 Section 4.3.1 Character classification keywords (lines 902-905) Current wording in description of "xdigit" class: "...If this keyword is not specified, the digits <0> through <9>, the uppercase letters "A" through , and the lowercase letters through , automatically belong to this class, with application-defined character values..." Problem: As written, this is different from the POSIX.2 requirement that the xdigit class must contain the portable digits 0-9 and the portable letters A-F and a-f. This only says that if the keyword is not specified, these portable characters are included, but with this text, a person could write an xdigit class that included only Hindi digits and some subset of Greek letters, and it would be legal. This is inconsistent with POSIX.2, and therefore must be changed. Action: Remove the clause "If this keyword is not specified," from the sentence beginning at line 902. The revised sentence will read "The digits <0> through <9>..." Also note that "A" in the sentence should be . OBJECTION #10 Section 4.3.1 Character classification keywords (lines 929-932) Current wording in tolower description: "...If this keyword is specified, the uppercase letters through , and their corresponding lowercase letter, are specified. If this keyword is not specified, the mapping is the reverse mapping of the one specified for toupper." Problem: The description is incorrect for what happens when the keyword is specified. This is what happens if the keyword is NOT specified. However, the sentence (if fixed) still would be unnecessary because the second sentence "If this keyword is not specified, the mapping is the reverse..." implies that to will be included. Action: Remove the sentence on lines 929-931 ("If this keyword is specified,...") OBJECTION #11 Section 4.3.1 Character classification keywords (lines 933-946) (and see also Section 4.3.2 "i18n" LC_CTYPE category [class "combining" and class "combining_level3; lines 1664-1694]) Current wording for "class" class: "Define characters to be classified in the class with the name given in the first operand, which is a string. This string only contains characters of the portable character set that either has the string "LETTER" in its description, or is a digit or or . The following operands are characters. This keyword is optional. The keyword can only be specified once per named class. The following two names are recognized: combining Characters to form composite graphic symbols, such as characters listed in ISO/IEC 10646:1993 annex B.1. combining_level3 Characters to form composite graphic symbols, that may also be represented by other characters, such as characters listed in ISO/IEC 10646-1:1993 annex B.2." And also current wording from the "i18n" FDCC-set definition, lines 1664-1694: "% The "combining" class reflects ISO/IEC 10646-1 annex B.1 % That is, all combining characters (level 2+3). class "combining" / ..; ..; ..;/ ..;..;..;/ ..;;;;;..;;/ ..;;;..;..;;/ ..;..;;;..;;/ ... ;..;..;;..;/ ;; % % The "combining_level3" class reflects ISO/IEC 10646-1 annex B.2 % That is, combining characters of level 3. class "combining_level3"; / ..;..;..;..;/ ..;..;..;;/ ;;;;;;;/ ;;;;;;;;;/ ;;;;..;;" Problem: I've quoted a lot of original text here, because this is a confusing problem. I could not understand from the description what the classes were supposed to be for, so I looked at the i18n FDCC-set example. It turns out the description and definition of the two combining classes is exactly backward. ISO 10646 defines three levels: Level 1 -- most restrictive; shall not contain any characters listed in Annex B.1 Level 2 -- less restrictive; shall not contain any characters listed in Annex B.2 Level 3 -- least restrictive; can contain any coded character. The members listed of the classes in the FDCC-set, however, do not match the definitions. What is called combining_level3 is the group of characters that canNOT appear in a Level 1 or 2 implementation. What is called "combining", and described as being "all combining characters (level 2 + 3)", actually is the list of characters that canNOT appear in a Level 1 implementation. Action: These classes do not exist in other standards and are so ill-defined that it is impossible to say what characters are supposed to be defined in which class. Remove lines 933-946 and lines 1664-1694 from the draft. OBJECTION #12 Section 4.3.1 Character classification keywords (lines 947-955) Current wording in width description: "Define the column width of characters, for example for use of the C function wcwidth(). The operands are first a list for characters, possibly using various ellipses, and semicolon separated, then a , and then the width of these characters given as an unsigned positive integer. Such width-lists separated by may be given for the various widths. The default value of width of characters in class "cntrl" and class "combining" is 0, else the default value of width is 1. A width for a character may be overridden by a WIDTH specification in a charmap. This keyword is optional." Problem: This description is very confusing. What does it mean that a "...width for a character may be overridden by a WIDTH specification in a charmap"? Does that mean if it's one thing in the charmap and another in the FDCC-set, the charmap wins? Why should width specifications be in two places? Also, this class is quite different from other LC_CTYPE classes. For other classes, one lists which characters are in that class, or a one-to-one mapping between uppercase and lowercase. This is different; you list a group of characters, and then define what value their width is. Each character in this class can have a different value, as opposed to other classes where it simply is a Boolean function -- if you're listed, you're in. This class is confusingly-defined, and seems out-of-place in the Boolean-oriented LC_CTYPE section. Action: Remove lines 947-955. OBJECTION #13 Section 4.3.1 Character classification keywords (lines 956-973) Problem: The map keyword is poorly described. According to Annex A, it is supposed to provide the functionality associated with the C library function towctrans(), but that's not clear from the text here ("Define the mapping of characters." What?). Action: Either remove this keyword, or rewrite the description to make it clearer that this is designed to allow mapping of one type of characters to another, related type. For example, you might want to map hiragana to katakana. Or Hindi digits to portable digits. Etc. OBJECTION #14 Section 4.3.1 Character classification keywords (lines 975-1002) Problem: The mapping table of character class combinations duplicates information in POSIX.2 without adding any new data about classes included in this document. Action: Either remove the table completely, since the information already is available in another standard, or update it to include combination information about classes added for this document. OBJECTION #15 Section 4.3.2 "i18n" LC_CTYPE category Problem: The membership of classes is inconsistent and confusing. With a few exceptions, it should match the classifications in the Unicode standard, where the classes/properties are comparable. Right now, class memberships are similar, but not identical to, comparable Unicode classes. For example: * the digit class includes a large group of digits that Unicode also identifies as being decimal, but is missing these groups: Myanmar (U1040..U1049) Ethiopic (U1369..U1371) Khmer (U17E0..U17E9) Mongolian (U1810..U1819) Fullwidth (UFF10..UFF19) Why should these be omitted, when the others are included? * the space class includes many of those that Unicode identifies as being space, but is missing: U00A0 -- No-Break Space U2007 -- Figure Space U202F -- Narrow No-Break Space Note that this class also has several control characters, like and , that Unicode does not consider part of the space class. However there is much existing practice on POSIX-based systems for including those controls, so it is understandable why they are here. * the punct class includes some, but not all, characters that Unicode identifies as being punctuation. For example: + it includes U2030..U2046, which are in the Unicode general punctuation block, but omits U2048 -- Question Exclamation Mark U2049 -- Exclamation Question Mark U204A -- Tironian Sign Et U204B -- Reversed Pilcrow Sign These also are in the general punctuation block. + it includes the currency symbols in the range U20A0..U20AA, but omits these other currency symbols in the same block: U20AB -- Dong Sign U20AC -- Euro Sign U20AD -- Kip Sign U20AE -- Tugrik Sign U20AF -- Drachma Sign + unlike Unicode 3.0, it includes most of the "Letterlike Symbols" from the range U2100..U213A in the punct class. This includes characters like U210B (Script Capital H), U2115 (Double-Struck Capital N), etc., but omits those that happen to have the word "LETTER" in their name; e.g., U210C -- Black-Letter Capital H U2111 -- Black-Letter Capital I This range also omits U2139 (Information Source), and U213A (Rotated Capital Q), which are also in this Letterlike Symbols block. It's not clear why any in this range are included in punct, but the particular subset of characters listed is even more confusing. There are many more differences between this i18n FDCC-set and Unicode, but the point is that the differences exist. This document should use the Unicode values where they exist instead of inventing another group of classifications that differ in dozens of small ways. Action: Revise the membership of all classes to match the lists Unicode provides, where they exist. HOWEVER, in the few cases where the common practice in POSIX systems differs from Unicode (for example, including some control characters in the space class), retain that existing practice for members of the portable character set. Note, too, that 14652 defines some classes for which there are no matching Unicode properties. Obviously, in these cases, the i18n FDCC-set cannot match Unicode. OBJECTION #15A Section 4.4 LC_COLLATE Problem: This is a placeholder objection for the content of Section 4.4 (LC_COLLATE). It has not been reviewed at this time because of the need to correlate its content with that of ISO/IEC 14651 and a lack of time to do so. Readers should NOT assume Section 4.4 is considered correct and complete simply because there are no specific objections at this time. OBJECTION #16 Section 4.5 LC_MONETARY (entire section) Problem: This section includes multiple keywords that were defined in POSIX.2, but it changes their definitions in such a way that existing applications would be invalid. This is incorrect. The changes allow the rules for multiple currencies to be specified in existing keywords, but in POSIX.2, only rules for single currencies can be defined. While the need to handle multiple currencies is real, the method defined here is significantly different than what has been done when other LC_ categories have had to be extended. When expanding LC_TIME to allow for multiple calendars, new keywords were added (era, era_year, etc.), rather than simply tacking new entries on to the end of existing keywords. Consider the previously existing LC_MONETARY keyword currency_symbol. It is defined in POSIX.2 as "The string that shall be used as the local currency symbol," while here it is defined as "One or more strings separated by semicolons that are used as the local currency symbol." (lines 2293-2294). Assume I'm defining French currency and the euro. I might have something like this: currency_symbol "";"" However, the description of this category no longer is correct -- these are not strings "that are used as the local currency symbol". That implies the two strings are synonyms for each other. The reality is that these are strings that represent different currencies used for this locale. They should not be glommed together in one keyword. It would be more accurate to separate these (and all other keywords that in this draft can take multiple values) into something like currency_symbol ""; alt_currency_symbol ""; As defined in this draft, it is not clear how application programs parse or use these values. Existing implementations request *the* currency symbol and use it to format values. What would happen to a previously conforming application if it requested the (single) currency_symbol value, but an array of strings was returned? Lines 6509-6510 of the rationale state: "Also the same application call can be made to be valid for countries with a single currency and countries with dual currencies." That's only true if the application is expecting one *or more* values. Existing applications expect exactly one value for most of these keywords. Now, suppose an application is rewritten to allow for multiple currency symbols. Now what? What rules does it use to decide which currency_symbol value it should use to format a monetary quantity? If the section were designed so that the existing definitions had not changed, but alt_* keywords were added when needed, an application could request currency_symbol when formatting national currency values, and alt_currency_symbol when formatting euros (or another alternate currency). Also, *because* this section allows multiple currencies to be specified, there is an implied tie between keywords. If currency_symbol includes French francs and euros (in that order), frac_digits, ps_cs_precedes, etc., must also specify the rules for francs and euros in the SAME order. The valid_from keyword attempts to explain this dependency, but the wording is very confusing and not restricted to that keyword. Moving to other keywords, there are a new set of int_* keywords. Under POSIX.2, there were only two such keywords -- int_currency_symbol and int_frac_digits. They were for formatting monetary values using the international currency strings (e.g., "USD " rather than "$" for the U.S. dollar; "DEM " rather than "DM" for the German mark; etc.). Under POSIX.2, quantities that used the international currency string and those that used the local currency symbol used the same values for keywords such as p_cs_precedes, p_sep_by_space, etc. Annex A says these have been added to accommodate "differences between local and international formats." For example? At the end of this section, the "i18n" FDCC-set does nothing to illuminate the many new keywords and revised definitions of existing keywords. Since attempted support for multiple currencies is the reason for the many changes and additions to this section (as compared to POSIX.2), an example in this section that illustrates how multiple currencies might actually be specified must be provided. There is an example in the rationale section, but the information needs to be available here. See below for additional comments on specific keywords. Action: Restore the original definitions of keywords that exist in the LC_MONETARY section of POSIX.2. Add new keywords for defining alternate currencies. Remove the additional int_* keywords, unless a concrete rationale with examples of real differences between local and international formats, is provided. Add an example that shows how to specify multiple currencies. EDITORIAL #17 Section 4.5 LC_MONETARY (lines 2250-2252) Current wording: "...Keywords that are not provided, string values set to the empty string "", or integer keywords set to -1, are used to indicate that the value is unspecified, and then no default is implied." Problem: This wording is unclear. Action: To follow POSIX.2 more closely, revise the sentence as follows: "Keywords that are not provided, string values set to the empty string (""), or integer keywords set to -1 shall be used to indicate that the value is not available. No defaults are implied." OBJECTION #18 Section 4.5 LC_MONETARY (lines 2258-2268) Current wording of valid_from keyword: "One or more strings separated by semicolons, representing a Gregorian date in the form "YYYYMMDD" according to ISO 8601, specifying the beginning date (inclusive from the beginning of day local time) of the validity of a currency. The position of the string in the list corresponds to the position of operands in other keywords in the LC_MONETARY category. The currencies should be ordered in terms of validity dates, and for each validity period with the currency that the amounts are stored in first. If not specified, it is taken to be an implementation-defined beginning of time. This keyword is optional." Problem: This wording is unclear and confusing. I think *part* of what this is trying to say is: "One or more strings, separated by semicolons, of Gregorian dates in the form "YYYYMMDD" that specify the date on which a currency became or becomes valid. Dates are inclusive from the beginning of the day local time....If not specified, the value of this keyword is an implementation-defined beginning of time. This keyword is optional." The earlier overall objection to this section notes that information about dependencies on the order of values is not restricted to this keyword. Thus, the sentence "The position of the string..." should not appear in this description. There is no reason to mention ISO 8601 here; specifying the YYYYMMDD order is sufficient. The sentence "The currencies should be ordered in terms of validity dates..." is unclear; I have no idea what it means. Action: Revise the text as recommended, rewrite the sentence about "The currencies should be ordered...", and add an example to show how this might be defined. OBJECTION #19 Section 4.5 LC_MONETARY (lines 2269-2274) Current wording of the valid_to keyword: "One or more strings separated by semicolons, representing a Gregorian date in the form "YYYYMMDD" according to ISO 8601, specifying the end date (inclusive to the end of day local time) of the validity of a currency. If not specified, it is taken to be an implementation-defined end of time. This keyword is optional." Problem: The current wording is unclear, and the default value is inappropriate since not all systems define an end of time. Action: Rewrite as follows: "One or more strings, separated by semicolons, of Gregorian dates in the form "YYYYMMDD" that specify the last day on which a currency was or will be valid. Dates are inclusive to the end of the day local time. This keyword is optional." OBJECTION #20 Section 4.5 LC_MONETARY (lines 2275-2292) Current wording: "one or more pairs of integers separated by a specifying the fixed conversion rate between the current currency (determined by the parameter number) and the first currency that is valid, determined by a date provided by the application. If the currency is not the first valid currency for the period in question, the first integer is for multiplying the first valid currency, and the second for dividing this result to get the amount in the current currency. The currency to be the current currency is selected by the application from the date applicable and the currency number (first, second, third etc valid currency at that date); and whether domestic or international formatting is used is also determined by the application. Each pair of integers are separated by a . The default value is "1/100". This keyword is optional..." Problem: The description of the conversion_rate keyword is incomprehensible. However, an example in the rationale section shows this definition for Deutsch marks and euros: conversion_rate 1; 195/100 From this, it appears that the first value for conversion rate should be 1, because it is the "primary" currency, and the value for the second currency should be the true conversion value. However, this example does not match the current keyword text. Note, for example, that an entry is supposed to be "one or more *pairs* of integers", but that the first value in the example is a single integer. It also is not clear that conversion rates should be in locales, since they often change over time. The fact that euro conversion rates are fixed in relation to certain national currencies is a specific instance of currency rules, but is not applicable around the world. Action: Remove this keyword. The description is not clear, the example does not match the description, and it addresses a euro-specific feature, as opposed to being generally applicable to multiple currencies and locales. Further, since previous recommendations are to define different currencies in separate keywords, it would not be consistent to continue defining rules for multiple currencies in conversion_rate. OBJECTION #21 Section 4.5 LC_MONETARY (multiple keyword entries) Current wording at end of all non-optional keywords: "This keyword is specified, unless the "copy" keyword is used." Problem: The multiple appearances of this sentence all are unnecessary. The description of the "copy" keyword states that if it "...is specified, no other keyword is specified." Thus, it is redundant to spell out the restriction about the "copy" keyword at the end of the other keywords. It also is inconsistent. Keyword descriptions in other sections of this draft do not include the redundant sentence. Action: Remove the sentence at lines 2317-2318, 2320-2321, 2323-2324, 2328-2329, 2333-2334, 2339-2340, 2344-2345, 2350-2351, 2367, 2384, 2392-2393, and 2397-2398. OBJECTION #22 Section 4.7 LC_TIME (lines 2540-2543 and 2547-2551) Current wording in abday and day keyword descriptions: "... The first string is the [abbreviated|full] name of the day corresponding to the first day of the week (default Sunday), the second the [abbreviated|full] name of the day corresponding to the second day of the week (default Monday), and so on." Problem: This wording implies that the first day of the week is locale-specific, and that the %a and %A descriptors may produce the locale-equivalent of "Sunday" if Sunday is defined as the first day of the week, *or* the locale-equivalent of "Monday" if Monday is defined as the first day of the week, etc. This differs from the existing POSIX.2 definition and the descriptions in ISO C for the keywords and the meaning of the format descriptors. In the other standards, abday, day, %a, and %A all are defined in terms of a week that begins on Sunday. Of course, many locales use a week that begins on Monday, and it is understandable that some want to support this within abday, day, and the format descriptors. But this is an incompatible change with existing practice that will break existing implementations. Further, support for Monday-first locales already exists with the %u, %V, and %W format descriptors. Action: Revise the text at 2540-2543 as follows: "The first string is the abbreviated name of the day corresponding to Sunday, the second the abbreviated name of the day corresponding to Monday, and so on." Revise the text at 2547-2551 as follows: ""The first string is the full name of the day corresponding to Sunday, the second the full name of the day corresponding to Monday, and so on." OBJECTION #23 Section 4.7 LC_TIME (lines 2552-2567) Current wording for week keyword: "Is used to define the number of days in a week, and which weekday is the first weekday (the first weekday has the value 1), and which week is to be considered the first in a year. The first operand is an integer specifying the number of days in the week. The second operand is an integer specifying the Gregorian date in the format YYYYMMDD, and it specifies a day that is a first weekday (all other first weekdays may then be calculated by adding or subtracting a whole multiple of the number of days in the week as specified with the first operand). The third operand is an integer specifying the weekday number to be contained in the first week of the year. The third operand may also be understood as the number of days required in a week for it to be considered the first week of the year. If the keyword is not specified the values are taken as 7, 19971130 (a Sunday), and 7 (Saturday), respectively. ISO 8601 conforming applications should use the values 7, 19971201 (a Monday), and 4 (Thursday), respectively. This keyword is optional." Problems: There are multiple problems with this description. 1. There is no need to define the number of days in a week, because the seven-day week is common to all major calendars. 2. The description says this keyword defines "...which weekday is the first weekday (the first weekday has the value 1)" which is confusing but probably is supposed to define which day of the week is considered the first (for example, Sunday is the first day of the week in some cultures, while Monday is in others). Assuming this interpretation is correct, the second operand here is ill-defined to meet this requirement. It requires picking a random date that falls on the first day of the week for this FDCC-set. In this example, November 30, 1997 falls on a Sunday, so it is the value used for locales that have a Sunday-first rule. Implementors then are required to calculate ALL other first weekdays (before and after) from the randomly chosen date. This is hogwash. 3. The description further says the keyword defines "...which week is to be considered the first in a year." It is more accurately defined later in the description as "the number of days required in a week for it to be considered the first in a year." The first definition is unclear and should be changed. Action: Remove the operand for the number of days in a week. Remove the operand that defines a date of a (random) first weekday. Change the description of the keyword to be defining "the number of days required in a week for it to be considered the first in a year." OBJECTION #24 Section 4.7 LC_TIME (lines 2569 and 2574) Problem: The descriptions of the abmon and mon keywords say they consist of "twelve or thirteen" month names. POSIX.2 and ISO C only support twelve-month calendars, and existing implementations will break if this is changed. Action: Change the descriptions of the keywords to say the operands consist of twelve month names, not "twelve or thirteen." OBJECTION! #25 Section 4.7 LC_TIME (timezone section; lines 2663-2757) Problem: It is completely inappropriate to specify timezone information in a FDCC-set. The draft says this is for specifying cultural conventions, but timezones can cross national boundaries and many time zones can exist within a single country. For countries like the U.S., Canada, Russia, Australia, and others that span many time zones, there is no way to determine which time zone to include in an FDCC-set, or, if multiple zones are included, how to figure out which one to use in what area. As the draft notes, the TZ (timezone) environment variable already exists for specifying time zone information. It absolutely does not belong within a locale or FDCC-set. Action. Remove lines 2663-2757. EDITORIAL #26 Section 4.7 LC_TIME (line 2767) Problem: Table 3 is called "Escape sequences for the date field", but all other text calls these values "field descriptors". Action: Change "Escape sequences" to "Field descriptors". OBJECTION #27 Section 4.7 LC_TIME (line 2780) Current wording for the %F descriptor: "The date in the format YYYY-MM-DD (ISO 8601 format)." Problem: Multiple other places in this draft describe "ISO 8601" format as YYYYMMDD. Action: Make all references to ISO 8601 consistent. OBJECTION #28 Section 4.7 LC_TIME (lines 2781-2782) Current wording for the %g and %G descriptors, respectively: "Week-based year within century, as a decimal number (00-99). Week-based year with century, as a decimal number (for example 1997)." Problem: There is no explanation of how a "week-based year" differs from any other year. The existing %y and %Y descriptors specify the year within a century, and the year with century, so there is no need for these new descriptors. Action: Remove lines 2781-2782. OBJECTION #29 Section 4.7 LC_TIME (line 2787) Current wording for the %m descriptor: "Month, as a decimal number (01-13)." Problem: As described previously, existing implementations support a 12-month calendar. Action: Change the text as follows: "Month, as a decimal number (01-12)." OBJECTION #30 Section 4.7 LC_TIME (lines 2819-2826) Current wording: "NOTE: %g, %G and %V give values according to the ISO 8601 week-based year. In this system, weeks begin on a Monday and week 1 of the year is the week that includes 4th January, which is also the week that includes the first Thursday of the year, and is also the first week that contains at least four days in the year. . . If the 29th, 30th or 31st January is a Monday, it and any following days are part of week 1 of the following year. Thus, for Tuesday 30th December 1997, %G is replaced by 1998 and %V is replaced by 1." Problem: The month name in one example is wrong. The sentence should read "...If the 29th, 30th, or 31st of December is a Monday,..." Also, since an earlier objection recommends removing %g and %G, this text should remove references to the descriptors, too. Action: Revise the text as indicated. OBJECTION #31 Section 4.8 LC_MESSAGES (lines 2931-2932) Current wording: "Note: This uses regular expression syntax with brackets ([]) to for example specify the both <+> and <1> is allowed as an affirmative answer." Problem: Grammatically incorrect sentence that doesn't say what it means to say. Inconsistent use of symbolic names. Also, since the definitions of yesexpr and noexpr say they are "extended regular expression[s]", it is not necessary to repeat that in the note. Action: Rewrite the text as follows: "For yesexpr, this specifies that either or is considered an affirmative answer. For noexpr, the supported negative responses are defined as or ." OBJECTION #32 Section 4.9 LC_XLITERATE (lines 2934-3047) Problem: The ability to transliterate characters from one writing system and/or language to another is something users might think of as a "wow, cool" bit of functionality. However, this is an extremely complex problem. The keywords and syntax defined in this section are completely inadequate to handle this problem, so this section should be removed from the document. Consider the example provided and the way it is described to work. Of course, the example is not intended to be a complete functioning transliteration section, but it raises enough questions to point out how inadequate this proposal is. Current wording: [begin] "4.9.3 Example of use of transliteration LC_XLITERATE include "de_DE";"de_repmap" default_missing translit_ignore .. ;;"";"" ; "" END LC_XLITERATE ... The "include" keyword specifies that the FDCC-set "de_DE" is copied and that the repertoiremap "de_repmap" is used to define the symbolic character names in the FDCC-set "de_DE". ... The first transliteration statement defines a number of transliterations for the LATIN LETTER AE, including into LATIN LETTER A WITH DIAERESIS, GREEK LETTER EPSILON, the two Latin letters A and E, and finally the LATIN LETTER E. The second transliteration statement defines transliteration of the LATIN LETTER S into GREEK LETTER SIGMA, and CYRILLIC LETTER ES. The third transliteration statement transliterates the two Latin letters K and O into the Japanese Hiragana character KO." [end] Start with the "include" keyword. The example shows including the de_DE FDCC-set, and according to the keyword description this is "the name of the FDCC-set...to transliterate from." So the plan here is to transliterate from German into something else. But what part or parts of a FDCC-set are supposed to be included here? The entire FDCC-set, with all sections as defined in 14652? If so, what purpose does it serve to include LC_CTYPE, LC_COLLATE, etc., sections here? If not, what exactly from the de_DE set is supposed to be included? There is no information in the LC_XLITERATE section to explain this. Under what circumstances might users define in a locale (FDCC-set) that they want to transliterate from German? Suppose this excerpt appears in a Japanese FDCC-set. It's easy to imagine users wanting to transliterate from Japanese to a number of other writing systems and/or languages. But under this design, a finite set of transliterations would have to be hard-coded into each FDCC-set, seriously limiting users' choices. This is an operation that should be like iconv -- users specify, independent of current configurations, what they want to convert from, and what they want to convert to. Hard-coding a set of instructions is unnecessarily restrictive. The include keyword also specifies the "de_repmap". The keyword definition says the repertoiremap is "to be used for the definition of the transliteration statements." What does that mean? That it defines the list of symbolic characters the German FDCC-set includes? If so, now what? If we continue assuming this is a Japanese FDCC-set, what if there is conflict between the symbolic names in the two sets? Now consider the sample transliteration statements themselves. ;;"";"" ; "" The first converts from to any of four different possible characters. (It's curious why some are in quotes, but others are not.) According to precedence rules, if the first possibility exists in the target set, that is how is transliterated; if not, the next one is tried, and so on. Now, it's hard to imagine a circumstance under which an would be present but not the first-possibility , but assume it isn't. The second choice listed here is a Greek . Thus, according to this, I'm transliterating Latin characters into other Latin characters, but if they're not available, I'll choose Greek next. Under what circumstances might someone choose such a transliteration? Of course, the fourth possibility listed is completely superfluous, because all FDCC-sets are required to support the portable characters. Both and from the third choice are in the portable set, so one would never get to the fourth choice. This second example line converts from a Latin to either a Greek or a Cyrillic . This seems to assume that you know what you're converting from, but have no idea what you want to convert to, and so are allowing any potential match. How, then, does a user prevent getting a result that mixes Latin, Greek, Cyrillic, and who-knows-what-else? Once again continuing the Japanese example, many Japanese encodings include Greek and Cyrillic characters. The third example shows Latin and being converted to Hiragana . What if the source language/writing system can pronounce a string in multiple ways? For example, consider English "through", "bough", "rough". This syntax seems to assume a one-to-one mapping between substrings and a target phonetic character. Action: The keywords and syntax in this section are inadequate to handle transliteration. Remove lines 2934-3047. EDITORIAL #33 Section 4.10 LC_NAME (lines 3072-3077, 3094) Problem: Inconsistent wording; what are called "field descriptors" elsewhere in this document are called "escape sequences" here. Action: Change all instances of "escape sequences" to "field descriptors." OBJECTION #34 Section 4.10 LC_NAME (line 3080-3081) Current wording for %g and %G: "First given name. First given initial." Problem: The descriptions are European and North American-centric in assuming a position of a given name. Perhaps the description should be "Primary given name?" Also, what qualifies as an "initial"? Any single character? Any single-byte character? Any single Latin character? Some explanation must be provided. Action: Remove assumptions about name position. Add information somewhere in the section about what qualifies as an "initial." OBJECTION #35 Section 4.10 LC_NAME (line 3082) Current wording for %l: "First given name with Latin letters." Problem: What is the rationale for having a descriptor of *first given names*, and only first given names, to be transcribed into Latin letters?? Action: Remove this descriptor and line 3082. OBJECTION #36 Section 4.10 LC_NAME (line 3084-3085) Current wording for the %m and %M descriptors: "Middle names. Middle initials." Problem: The descriptions are European and North American-centric in assuming that additional given names are "middle" names. Also, while other field descriptors here take a single value, these are described such that they could contain multiple names/initials. Thus, it appears multiples would be treated as a unit. For example, if someone has three given names -- Mary Laura Grace -- it appears the value of %m would be "Laura Grace" rather than "Laura" and "Grace". But most people treat each name as a separate entity. It makes more sense to have a single name in each format descriptor, and to use multiple %m's, if needed. See previous objection about the definition of "initial". Action: Change the descriptions to "Additional given name" and "Initial for additional given name." OBJECTION #37 Section 4.10 LC_NAME (line 3086) Problem: The format descriptor %p is described as "Profession." What does this mean and why does it appear in something that is described as being for "addressing a person; e.g., in a postal address or in a letter"? What kinds of values are expected here? Software engineer? Human Resources representative? Journalist? Garbage collector? Truck driver? Training coordinator? All or some of these? How might these be used within a postal address? Within a letter? Action: If there is a legitimate need for this field, add that information. Otherwise, remove the descriptor. OBJECTION #38 Section 4.10 LC_NAME (line 3100-3103) Current wording: % This is the ISO/IEC TR 14652 "i18n" definition for % the LC_NAME category. name_fmt "/ " Problem: Since few people have ASCII values memorized, add a comment that explains this name_fmt specifies %p%t%g%t%m%t%f, which is Profession, First (Primary?) Name, Middle (Additional?) Name, Family Name. However, remove %p (Profession)... Action: Make the recommended changes. OBJECTION #39 Section 4.11 LC_ADDRESS (lines 3108-3110 and 3125-3137) Current wording: "The LC_ADDRESS category defines formats to be used in specifying a location like a person's living or office, for use in a postal address or in a letter, and other items related to geography...." Problem: First, there is the wording problem of the phrase "...specifying a location like a person's living or office,..." What is a person's living? This probably should be "...specifying a location like a person's home or office,..." Second, given this description of the LC_ADDRESS section, why are there four keywords for identifying natural language? While there is justification for a locale or cultural file to include natural language information, it is out-of-place in the LC_ADDRESS section. The natural language information does not "define formats for use in specifying a location...or other items related to geography." Action: Reword the sentence at lines 3108-3110. Remove lines 3125-3137. OBJECTION #40 Section 4.11 LC_ADDRESS (entire section) Problem: This section uses the term "escape sequences" for what it called "field descriptors" elsewhere in the draft. "Field descriptor" is the term POSIX.2 uses, and this draft should consistently use it as well. Action: Change all occurrences of "escape sequence" in this section to "field descriptor" to be consistent with most earlier sections. OBJECTION #41 Section 4.11 LC_ADDRESS (lines 3115-3120) Current wording for postal_fmt keyword: "Define the appropriate representation of a postal address such as street and city. The proper formatting of a person's name and title is done with the "name_fmt" keyword of the LC_NAME category. The operand consists of a string, and can contain any combination of characters and field descriptors. In addition, the string can contain escape sequences defined below." Problem: Most postal addresses include the name of the addressee, but from this description and from the listed field descriptors, name formatting is not described here. That seems to mean users should specify name_fmt information LC_NAME and address-information-without-names in LC_ADDRESS. The two cannot be mixed because each uses the same descriptors to mean different things -- e.g., %f means family name in LC_NAME, but firm name in LC_ADDRESS; %S means salutation in LC_NAME, but state, province, or prefecture in LC_ADDRESS. How are the two values from separate sections tied together without causing collisions? Action: Explain in this section how to add an addressee's name to an address. OBJECTION #42 Section 4.11 LC_ADDRESS (lines 3123-3124) Current wording of country_post keyword: "The operand is a string with the abbreviation of the country, used for postal addresses, for example by CEPT-MAILCODE." Problem: What is CEPT-MAILCODE? Is it the only abbreviation allowed, or are other? If others are allowed, how does a user identify the abbreviation in use? Action: Either explain what CEPT-MAILCODE is, or remove the reference to it. If it is retained, explain either how to identify the abbreviation system in use, or that there is no way to identify which abbreviation system is being used. OBJECTION #43 Section 4.11 LC_ADDRESS (lines 3145-3163) Current wording for selected field descriptors: "%a C/O address. ... %h House number or designation. %N If any graphical characters have been specified then an end of line is made. %t If the preceding escape sequence resulted in an empty string, then the empty string, else a . %r Room number, door designation. %e Floor number. %C Country designation, from the keyword. %l Local township ... %c Country." Problems: First, it is not clear which descriptors, if any, are restricted to holding numbers only. Usually, a description with the word "number" in it would be assumed to be numeric only, but addresses that have a floor number in them tend to be something like "2nd floor" rather than a simple number, and a house number may include other characters along with numbers. If any of these are restricted to numeric values, that should be spelled out. Second, some descriptions are inadequate. Specifically: %a -- what is a C/O address? In English, this is "in care of," and it identifies a person, not an address. And earlier objections note that people's names can't be included in LC_ADDRESS because of the overlap between LC_NAME and LC_ADDRESS field descriptors. So what is intended for this field? %N -- it would be clearer to say "Insert an end-of-line if the previous descriptor's value was not an empty string; otherwise ignore." %t -- what does this mean? Suppose the preceding descriptor was %f, and there was no value for it. This says do nothing. What purpose does that serve? %r -- can this include all characters or just numeric? %l -- How does this differ from %T? %c -- Is this value taken from the country_name keyword? If so, that should be listed here. Action: Make the recommended changes or add more information to explain the intention of a given field descriptor. OBJECTION #44 Section 4.11 LC_ADDRESS (lines 3174-3184) Current wording: " LC_ADDRESS % This is the ISO/IEC TR 14652 "i18n" definition for % the LC_ADDRESS category. % postal_fmt "/ / / / " END LC_ADDRESS" Problem: Once again, most of us don't have ASCII memorized, so there should be a comment that explains what has been defined for this keyword. Currently, it is: "%a%N%f%N%d%N%b%N%s%h%e%r%N\ %C%z%T%N%c%N" Even this is very cryptic, so here is more information with all "%N" values converted to , and all and characters indicated: c/o firm name department name building name street or block name house number floor number room number country_post value zip/postal code town/city country Here's an example of a fictional address using this format: c/o General Electric Consumer Products Division Building 52 Lightbulb Road 110 2 57 USA-44555 Chicago United States of America Given the confusion about the %a (c/o address) descriptor, the sample value here is simply a place-holder. This also assumes that house, floor, and room number values must be numeric only, thought that may be incorrect. While it certainly is true that addresses are culture-specific, and no one format will satisfy all, the "i18n" value here matches the postal_fmt value in the sample Danish FDCC-set later in this draft. It appears, then, that this format matches Danish conventions. It's not clear the listed order is appropriate for an international standard. For example, using the field names defined here, in the U.S. the order generally is: //not defined in LC_ADDRESS //uncommon //uncommon The fact that the existing postal_fmt lists country in two different ways, does not include a value for state/province, and puts the town/city after country and zip/postal code makes this unsuitable for U.S. addresses. Of course, the goal is not to define U.S. addresses, but it's not clear whether the value listed is appropriate for a significant number of users from other countries. Action: Research whether the listed postal_fmt value is appropriate for a significant percentage of the world community. If not, revise the value. Regardless of whether postal_fmt changes, add a comment explaining what the value is (all the descriptors plus an explanation of them). OBJECTION #45 Section 4.12 LC_TELEPHONE (entire section) Problem: This section uses the term "escape sequences" for what it called "field descriptors" elsewhere in the draft. "Field descriptor" is the term POSIX.2 uses, and this draft should consistently use it as well. Action: Change all occurrences of "escape sequence" in this section to "field descriptor" to be consistent with most earlier sections. OBJECTION #46 Section 4.12 LC_TELEPHONE (lines 3212-3216) Current list of field descriptors: "%a area code without prefix (prefix is often <0>). %A area code including prefix (prefix is often <0>). %l local number. %c country code %C alternative carrier service code used for dialing abroad" Problem: These field descriptors are ambiguously described, and it's not clear they are adequate for specifying telephone numbers. Specific problems include: * When the field descriptors contain numeric values, are those values restricted to the portable digits, or can they contain other decimal digits? Either way, this information needs to be included. * What is the "prefix" that %a and %A mention? There is no description of it. * Is %l restricted to numeric content only, or can it contain characters users commonly use to make local phone numbers more readable? For example, if the local number is 4561234, could %l contain only "4561234", or could it contain "456-1234"? If it could contain the latter, how does one define where format characters should be included? (Formatting conventions are culture-specific.) If it can only contain numbers, this is inadequate, because local phone numbers almost always include some non-numeric characters to improve readability. * There needs to be more information about what the "alternative carrier service code" is. It's not clear whether it is useful, since there's nothing to explain what it is. * What about extensions? Some local phone numbers have extensions to them (e.g., 434-1212 x97), but no extension field is provided here. Action: Add information about prefix and alternative carrier service codes. Add a field descriptor for extensions. Add information about numeric restrictions, or lack thereof. Add information about formatting local numbers. OBJECTION #47 Section 4.12 LC_TELEPHONE (lines 3221-3227) Current wording: " LC_TELEPHONE % This is the ISO/IEC TR 14652 "i18n" definition for % the LC_TELEPHONE category. % tel_int_fmt "/ " END LC_TELEPHONE" Problem: As before, most people have not memorized ASCII, so there needs to be a comment that explains what this represents. A comment might, in fact, have helped bring to light that this format contains two errors. It currently is defined as: +%c+a+l Thus, two field descriptors are missing the required leading "%" signs. To match what the author presumably intended, the actual definition should be: +%c+%a+%l and lines 3225-3226 would be: tel_int_fmt "/ " However, consider the output of a telephone number using this format. It could be: +1 +212 +5551212 //assumes %l cannot contain formatting characters +44 +91 +12-34-56 //assumes %l can contain formatting characters Many telephone numbers in "international format" use the to designate the country code, but we are not aware of any that use the before the area code and local number. Action: Remove the before %a and %l in the format. Add the characters before the "a" and "l" format descriptors. Add a comment explaining what tel_int_fmt designates. OBJECTION #48 Section 5. CHARMAP (lines 3232-3233) Current wording: "A character set description may exist for each coded character set supported by an application. This text is referred elsewhere in this Technical Report as a charmap." Problem: This does not make sense. Applications should not support specific coded character sets; implementations like OSes and desktops usually provide such support. Also "This text is referred elsewhere..." is incoherent. Action: Reword the paragraph as follows: "A character set description file may exist for each coded character set supported by the implementation. This file is referred to elsewhere in this Technical Report as a charmap." OBJECTION #49 Section 5. CHARMAP (lines 3267-3276 and other affected lines throughout) Current wording for and , respectively: "The escape character used to indicate that the characters following is interpreted in a special way, as defined later in this subclause. This defaults to backslash (\). The character slash (/) is used in all the following text and examples, unless otherwise noted. The character that when placed in column 1 of a charmap line, is used to indicate that the line is ignored. The default character is the number sign (#). The character percent-sign (%) is used in all the following text and examples, unless otherwise noted." Problem: This document should use the default and , rather than the characters chosen here. Using the defaults aligns this document with POSIX.2. Action: Reword the sentences as follows: //for "This defaults to backslash (\), which is the character used in all following text and examples, unless otherwise noted." //for "This defaults to the number sign (#), which is the character used in all following text and examples, unless otherwise noted." Also, change the examples throughout this document to match the usage described here. OBJECTION #50 Section 5. CHARMAP (lines 3283-3310) Problem: The , , and keywords are designed to allow charmap writers to specify ISO 2022 escape sequences. With more of the world's internationalization implementations moving to ISO 10646 and Unicode, it is not necessary to add increasingly-obsolete ISO 2022 syntax to the charmap. Note also that the existing description of refers to the keyword, not . Action: Delete these keywords, and the example at lines 3406-3478. Also note that the example variously calls a particular code set and , and that the example also uses as a keyword, even though the actual keyword is . Of course, all this should be removed. EDITORIAL #51 Section 5. Charmap (line 3374) Current wording: "...(hexadecimal constants is recommended)." Problem: Grammar. Action: Rewrite as: "...(hexadecimal constants are recommended)." OBJECTION #52 Section 5. Width subsection (lines 3481-3511) Problem: Both the FDCC-set description and the charmap have keywords for defining width. It is incorrect to have them both places; it will only lead to confusion. Also note, with a mixture of amusement and fatigue, that the width keyword is currently defined as taking an "unassigned positive integer" (line 3487). Action: Keep width information in only one place. This seems a bit more appropriate than the FDCC-set, but if it is retained there (over Objection #11), it must be removed here. OBJECTION #53 Section 6. REPERTOIREMAP (entire section) Problem: There are multiple problems with this section. They include: * The naming conventions chosen for symbolic characters. The cited justification is the "many POSIX charmaps registered with ISO/IEC 15897" and "use on the Internet". However, ISO/IEC 15897 only defines the information that can be contained in its registry of cultural elements, not the naming conventions to be used. The author of this draft has submitted multiple charmaps to be registered under ISO/IEC 15897 and has used the naming conventions he cites here. In essence, he is endorsing himself when he points to those charmaps and their naming conventions. Note, too, that the POSIX charmaps have been offered as international standards since the early 1990s, but they have only been used when organizations take free software (e.g., Linux). There is little evidence of them actually being used, and ample evidence that industry leaders who ship charmaps and locales are NOT using these naming conventions. The naming conventions are unnecessarily obscure and should not be the ones used for a repertoiremap. * The repertoire of the repertoiremap is curiously incomplete. It cites ISO/IEC 10646, but contains only a subset of characters in that standard. If the repertoiremap exists, it should contain the entire repertoire of characters in ISO/IEC 10646. It's difficult to determine exactly what characters are and are not in the repertoiremap because they are reordered relative to their ISO/IEC 10646 code points. The repertoire may most closely match Unicode R2.0, but *without* any of the thousands of CJK Unified Ideographs. * At lines 3642-3667, the repertoiremap includes "weight" characters (e.g., , , , etc.) that are supposed to indicate the position of the last a, the last b, the last c, and so on. While potentially handy for those who use the Latin script, it's questionable why Latin-specific weights should be in a repertoiremap. Further, these weights are equated to ISO/IEC 10646 codepoints: Weight indicating the position of the last a Weight indicating the position of the last b Weight indicating the position of the last c Weight indicating the position of the last d Weight indicating the position of the last e Weight indicating the position of the last f ... However, these code points already are assigned to other characters. is LATIN SMALL LETTER TURNED ALPHA; is LATIN CAPITAL LETTER B WITH TOPBAR; is LATIN SMALL LETTER C WITH CURL; and so on. This is an obvious conflict with ISO/IEC 10646. * As noted, the order of the repertoiremap does not match ISO/IEC 10646. It should. There is nothing to be gained by changing the order, and a lot of easy look-up ability to be lost. Action: Since the ISO/IEC 10646 identifier names are being used elsewhere in the document, it is not clear that a repertoiremap is needed at all. If it continues to exist, the justification for symbolic names is faulty and should be removed. More mnemonic symbolic names should be substituted. The repertoiremap should include the full ISO/IEC 10646 repertoire. The weights must be removed to avoid conflict with ISO/IEC 10646. Entries must be in the same order as they appear in ISO/IEC 10646. OBJECTION #54 Annexes Problem: This is a placeholder objection for the content of the annexes. They have not been reviewed at this time because there are so many objections to the main sections. Assuming those objections are processed appropriately, the annexes will have to change in multiple ways to accommodate the many changes. Readers should NOT assume the annexes are considered correct and complete simply because there are no specific objections at this time to them. End of document