L2/04-035 Subject: Comments on draft UTS BOCU-1 Source: Sandra Martin O'Donnell, HP Date: January 27, 2004 I have reviewed the draft UTS BOCU-1: MIME-Compatible Unicode Compression, and I have a number of questions and concerns. The draft describes Binary Ordered Compression for Unicode (BOCU). The impetus for this, according to the draft, is to address the complaint that UTF-8 consumes too much space per character for many Asian encodings compared to national encodings (e.g., most kanji are encoded in two bytes for Japanese EUC or Shift-JIS, but three bytes in UTF-8). BOCU-1 is in addition to the long-existing SCSU (Standard Compression Scheme for Unicode). The draft says that SCSU has not been widely implemented, and that it is incompatible with MIME and so can't be used in text email. This draft actually describes two things: a generic BOCU, and the more specific BOCU-1. Tracing back through earlier versions of the existing Unicode Technical Note, it appears the generic BOCU was developed first, and that BOCU-1 was added because the generic did not address the MIME issue. BOCU-1 is first described in UTN #6 dated 2002-08-09, and generic BOCU is described in a "Working Draft" dated 2001-05-30. Markus Sherer and Mark Davis are the co-authors of all documents relating to BOCU. Regardless of the reasons for there being generic BOCU and BOCU-1, I am quite concerned about creating more forms of Unicode, and I'd like the UTC to consider this issue carefully. This is proposed as a Unicode Technical Standard, so it's more official than just a Note or even a UTR (Unicode Technical Report). Unless there is compelling evidence that this form is required to convince Asian sites to adopt Unicode, I do not think we should adopt this as a UTS or UTR. Since generic BOCU has existed for almost three years, and BOCU-1 has been described since mid-2002, this would give Asian sites enough time to implement one or both forms -- or express strong interest in seeing them implemented. BOCU-1 has been implemented in ICU (International Components for Unicode), but is there any information about other implementations? Specifically, in Asia, or by companies/organizations that do significant business in Asia? If BOCU-1 has been implemented, is there information about its use? It's not clear to me that the Consortium needs to solve the space problem associated with UTF-8. Yes, we have heard the complaint that UTF-8 consumes more space for Asian characters than do native encodings, but do we have information to show that file size remains an important enough issue that this is a significant hurdle in Unicode acceptance? There are always more forms we can add to solve special classes of problems, but each adds to the cost of implementing the standard. I am not convinced that the benefits associated with BOCU-1 outweigh the complexity it adds for Unicode implementers. If we do approve this as a UTS, what does that do to the status of SCSU? Do we really need two Unicode compression algorithms, even though BOCU focuses on UTF-8 and so is not exactly equivalent to SCSU? (Perhaps SCSU should be deprecated, since this draft notes that is has not been widely implemented, but I don't know what compatibility issues that raises.) What about guidance with respect to whether generic BOCU should ever be used? The draft describes generic BOCU briefly, and then ignores it for the rest of the document, which probably means it is to be considered obsolete. But I'm guessing at that. At the very least, the draft should clarify BOCU-1 with respect to SCSU and generic BOCU. But I need information about the compelling case for adding Yet Another Unicode Form before I can support this as a UTS.