INCITS/L2 response to Keown letter

L2/03-344

17 September 2003

To: Dave Michael, Chairman of the INCITS Standards Policy Board

CC: Jennifer Garner, Associate Director, Standards Programs, INCITS

Reference: Letter from Elaine Keown to ANSI

Dear Dave,

Thank you for forwarding Elaine Keown’s letter of 5 August 2003.

Ms. Keown states two major concerns: she is concerned about the procedure by which characters are encoded in ISO/IEC 10646, and she is concerned about the appropriateness of stakeholders involved in the encoding process. I’d like to clarify a few points that Ms. Keown may not be aware of. Hopefully, this will address both concerns to everyone’s satisfaction.

1. General procedure.

INCITS/L2 (and the Unicode Technical Committee or UTC) strives to have an open yet rigorous procedure for character encoding. It is our goal to serve the various linguistic and cultural communities with an appropriate character repertoire in ISO/IEC 10646; however, there is a process by which these repertoires are developed, both at the national level (L2) and at the international level (SC2/WG2). All are welcome to contribute, provided they follow these procedures.

This well-documented process for encoding characters is available via the Unicode website (http://www.unicode.org/pending/proposals.html) and the SC2/WG2 website (http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/projects), specifically document WG2 N1502. This process is in place to ensure technical and linguistic continuity with the rest of the standard, and has been documented after years of experience working with proposals from numerous communities.

To date, neither SC2/WG2 nor L2 has received an encoding proposal or contribution from Ms. Keown. She did communicate with Arnold Winkler, former L2 chair, participated occasionally on the Unicode mail list, and presented a paper on Hebrew at the International Unicode Conference in Hong Kong in April 2001; however, none of this work meets the criteria for a character encoding proposal. Should Ms. Keown wish to formally submit a character proposal, L2 would be happy to consider her proposal.

2. Specific procedural issues with the Hebrew block.

Ms. Keown expressed several concerns about the construction and content of the Hebrew character block.

While developing the Hebrew repertoire, SC2/WG2 received contributions from Hebrew academicians and linguists. The initial Hebrew block was based on ISO/IEC 8859-8, and other characters have been added since then, following the character encoding procedure.

With regards to her other specific statements on Hebrew:

a. Coptic was moved. As Ms. Keown rightly comments, L2/UTC and SC2/WG2 policy is against moving characters once they are encoded (see http://www.unicode.org/standard/stability_policy.html). She then states that the Coptic block was moved. However, for Coptic, no characters were moved. Rather, 58 characters were added for Coptic at positions 2C80-2CBF (reference document WG2 N2611).

b. The Hebrew repertoire is not contiguous. This is not unique to Hebrew. The repertoires for Latin, Cyrillic and Khmer, for example, are broken into several non-contiguous blocks. The ideographs needed for Chinese, Japanese and Korean are also spread across multiple planes. Placement of the character repertoire in the standard however has no impact on software implementation. Future character additions will be allocated as appropriately as possible; however, there is no guarantee in the standard that characters of a particular writing system will be co-located.

c. Collation is broken by the repertoire. Unicode and ISO/IEC 10646 are encoding standards, not collation standards. The location of the characters in the repertoire does not determine or impact collation order for Hebrew or any other language/writing system—sorting is determined by the implementation. There are related standards which collate the repertoire of Unicode and 10646, however, they are not part of the encoding standard. Ms. Keown should review the Unicode Collation Algorithm (Unicode Standard Annex #10, http://www.unicode.org/reports/tr10/) and ISO/IEC 14651 (International String Ordering) for more information.

d. Hebrew subsets are poorly grouped. The current subsets in 10646 were developed based on input from user communities. There is a process by which new subsets can be defined. Again, SC2/WG2 has yet to receive a formal proposal from Ms. Keown, and welcomes any contributions concerning Hebrew subsetting.

e. Only 3 Hebrew script languages are partially covered. As noted earlier, there is a formal process for encoding characters. If Ms. Keown has knowledge of additional scripts needed for encoding Hebrew, we welcome her contributions.

f. The block is missing symbols needed for Leningrad and other critical codices. We welcome any contributions for outstanding characters, following procedural guidelines.

g. Some symbols are conflated and need semantic differentiation. We have yet to receive any formal proposals on the need to differentiate these symbols from Ms. Keown; again, a proposal which follows the submission guidelines is welcome.

Participation of stakeholders, quality of participants.

Ms. Keown raised a concern about the decision makers in the character encoding process at the national level. She may find the following interesting:

There are a number of individuals involved in the character encoding process in L2 and SC2/WG2 who have formal training and advanced degrees in linguistics or language-related fields. Their specialties include East Asian linguistics, indigenous languages, archaic scripts and language policy. This formal knowledge combined with the committee’s expertise in internationalization and character sets provides a balance of cultural correctness and technical viability appropriate to an international character encoding standard.
INCITS/L2 works closely with academia to develop new character encoding proposals. For example, in the last year, academics from University of California/Berkeley, University of Washington, and University of California/Davis have worked with L2/UTC to develop character encoding proposals. This partnership has resulted in several character repertoires accepted into ISO/IEC 10646.
The University of California/Berkeley Department of Linguistics has developed the Script Encoding Initiative. SEI was created to provide a means to help Unicode encode lesser known scripts that may not have the financial backing of the software industry. For L2/UTC, it is important that these scripts and the cultures they represent not be left on the wrong side of the digital divide. Ideally, all scripts—and respectively, all languages and cultures—will be included in ISO/IEC 10646 someday.

I hope that it is clear from the above that INCITS/L2 engages in a character encoding process that is open to all interested stakeholders. In addition, this process is rigorous enough to meet the linguistic and cultural criteria of a community and provide an interoperable, international standard that works for global software.

Please feel free to contact me should there be any questions or comments.

With best regards

Cathy Wissink

Chair, INCITS/L2 (Character Sets and Internationalization)