L2/11-131 Title: Proposal to Update Syntax for Unicode/UCS Sequence Identifiers (USI) in ISO/IEC 10646 Author: Ken Whistler Date: April 27, 2011 Action: For consideration by the UTC and L2 Background The U.S. NB requested in its comments on the ballot for CD 10646 3rd Edition that the syntax for the data file containing the list of Named UCS Sequence Identifiers (NUSI.txt) maintain the same data format as the UCD data file, NamedSequences.txt. (See T5 in L2/10-385, = WG2 N3936.) That comment also requested a corresponding change for the documentation of the syntax of the second field of NUSI.txt in Clause 25, so that the syntax used by NUSI.txt would be consistent with that used in NamedSequences.txt. As it turned out, that request was controversial in WG2, and the U.S. NB withdrew the comment at the last WG2 meeting, rather than make the disposition of comments for the ballot more problematical than it already was. The main reason for the controversy, as I understand it, was the mismatch that would ensue between the syntax for the Named UCS Sequence Identifiers and its data file, and the syntax specified in the standard for UCS Sequence Identifiers in Clause 6.6. To address this issue, and to be able to more forward on the real goal here, which is to avoid having to maintain two different normative data files in parallel, but with different syntaxes, I am now suggesting a modified request. This would address the specification of the UCS Sequence Identifiers in Clause 6.6, so as to eliminate the problem of a syntax mismatch. Accordingly, I would like the UTC to discuss and approve the following suggested changes, and for L2 to approve this as a U.S. NB position to forward to WG2 for their consideration. ============================ proposal ========================= [Insert appropriate document header boilerplate here.] Introduction Clause 6.6 of ISO/IEC 10646 defines the UCS Sequence Identifier (USI). The text of the clause in the FCD for the 3rd Edition currently reads as follows: ISO/IEC 10646 defines an identifier for any sequence of code points taken from the standard. Such an identifier is known as a UCS Sequence Identifier (USI). For a sequence of n code points it has the following form: where UID1, UID2, etc. represent the short identifiers of the corresponding code points, in the same order as those code points appear in the sequence. If each of the code points in such a sequence has a character allocated to it, the USI can be used to identify the sequence of characters allocated at those code points. The syntax for UID1, UID2, etc. is specified in 6.5. A COMMA character (optionally followed by a SPACE character) separates the UIDs. The UCS Sequence Identifier includes at least two UIDs; it begins with a LESS-THAN SIGN and is terminated by a GREATER-THAN SIGN. The full syntax of the notation of a UCS Sequence Identifier, in Backus-Naur form, is "<" (xxxx | xxxxx | xxxxxx) (("," space?) (xxxx | xxxxx | xxxxxx))+ ">" where "x" represents one hexadecimal digit (0 to 9, A to F, or a to f). This notation specified in that clause follows widespread practice for citation of UCS character sequences in descriptive text. In such contexts, the use of angle brackets is not problematical, and in fact helps in visual identification of the sequences. The mix of commas and spaces also helps visually. However, in data files, this notation is unnecessarily complicated to parse, and in actual practice, different, simpler notations are widely used in data files for the representation of UCS Sequences. We propose to modify the text of Clause 6.6 to accomplish the following goals: 1. Make the specification of the syntax for UCS Sequence Identifiers (USI) clearer. 2. While retaining the validity of the existing definition, extend the allowed representation of the USI, so that formats widely implemented in data files will be recognized as valid USIs. 3. Make it simpler to maintain associated data files for specifying normative data such as the list of Named UCS Sequence Identifiers, without having to construct duplicate, parallel data files containing the same substantive content, but using distinct formats. The revision for Clause 6.6 should use a more extended Backus-Naur form for the specification of the UCS Sequence Identifier (USI), so that it will be clear what is intended. As for the existing Clause 6.6, this specification makes use of the definition of UCS Short Identifiers (UID) from Clause 6.5. ISO/IEC 10646 defines an identifier for any sequence of code points taken from the standard. Such an identifier is known as a UCS Sequence Identifier (USI). The format of a USI depends on the definition of a UCS Short Identifier (UID), specified in Clause 6.5. The full format for a USI is specified by the following, in Backus-Naur form: SPACE ::= U+0020 COMMA ::= U+002C LEFTBRACKET ::= U+003C RIGHTBRACKET ::= U+003E Space_Delimited_Sequence ::= UID (SPACE+ UID)+ Comma_Delimited_Sequence ::= UID (COMMA SPACE? UID)+ Unbracketed_Sequence ::= Space_Delimited_Sequence | Comma_Delimited_Sequence Bracketed_Sequence ::= LEFTBRACKET Unbracketed_Sequence RIGHTBRACKET UCS_Sequence_Identifier ::= Unbracketed_Sequence | Bracketed_Sequence In a UCS Sequence Identifier, the UID values occur in the same order as those code points appear in the sequence to be represented. If each of the code points in such a sequence has a character allocated to it, the USI can be used to identify the sequence of characters allocated at those code points. A UCS Sequence Identifier includes at least two UIDs. Example 1. For typical use in descriptive text, or in printed tables meant to be read, a USI may be represented using a format which is more difficult to parse, but which facilitates reading. For example, using a Bracketed_Sequence which contains a Comma_Delimited_Sequence, and which contains UIDs using the "U+" prefix: Example 2. For typical use in data files, a USI may be represented using a format which is easier for automatic parsing. For example, using an Unbracketed_Sequence which contains a Space_Delimited Sequence, and which contains UIDs without the "U+" or other prefixes: 0069 0307 0301 If this change is adopted for the specification of the USI, then the text of Clause 25 pertaining to the data file which defines Named UCS Sequence Identifiers (NUSI) can also be simplified and modified so that there will be no need to maintain multiple versions of such data file with radically different syntax conventions. Currently, the relevant text reads: The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line mark, that specifies after a 5-lines header, Named UCS Sequence Identifiers; each line containing the following information organized in fields delimited by a TAB character: * 1st field: UCS sequence, following syntax defined in 6.6 * 2nd U : Name of the NUSI (following rules given in 23.5) We suggest that this be modified to the following text: The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line mark, that specifies Named UCS Sequence Identifiers. Each line in the text file contains the following information organized in two fields: * 1st field: Name of the NUSI (following the rules given in Clause 23.5) * 2nd field: The USI associated with that Name (following the syntax defined in Clause 6.6) The two fields are delimited by a SEMICOLON (';') followed optionally by zero or more SPACE characters. Comment lines, starting with a NUMBER SIGN ('#') are informational only. Comment lines and blank lines in the text file should be ignored by any automatic process which parses the data file to extract the normative list of NUSIs. The data file, NUSI.txt, should then be updated to use the field order specified, to use a SEMICOLON as the field delimiter, instead of a TAB character, and to mark the header lines explicitly with the comment line introduction character, so as to simplify the data parsing, and bring it into line with the parsing already in widespread use for similar data files related to ISO/IEC 10646 content.