L2/04-035

Subject: Comments on draft UTS BOCU-1
Source: Sandra Martin O'Donnell, HP
Date: January 27, 2004

I have reviewed the draft UTS BOCU-1: MIME-Compatible Unicode
Compression, and I have a number of questions and concerns. The draft
describes Binary Ordered Compression for Unicode (BOCU). The impetus for
this, according to the draft, is to address the complaint that UTF-8
consumes too much space per character for many Asian encodings compared
to national encodings (e.g., most kanji are encoded in two bytes for
Japanese EUC or Shift-JIS, but three bytes in UTF-8).

BOCU-1 is in addition to the long-existing SCSU (Standard Compression
Scheme for Unicode). The draft says that SCSU has not been widely
implemented, and that it is incompatible with MIME and so can't be used
in text email. 

This draft actually describes two things: a generic BOCU, and the more
specific BOCU-1. Tracing back through earlier versions of the existing
Unicode Technical Note, it appears the generic BOCU was developed first,
and that BOCU-1 was added because the generic did not address the MIME
issue. BOCU-1 is first described in UTN #6 dated 2002-08-09, and generic
BOCU is described in a "Working Draft" dated 2001-05-30. Markus Sherer
and Mark Davis are the co-authors of all documents relating to BOCU.

Regardless of the reasons for there being generic BOCU and BOCU-1, I am
quite concerned about creating more forms of Unicode, and I'd like the
UTC to consider this issue carefully. This is proposed as a Unicode
Technical Standard, so it's more official than just a Note or even a UTR
(Unicode Technical Report). Unless there is compelling evidence that
this form is required to convince Asian sites to adopt Unicode, I do not
think we should adopt this as a UTS or UTR. 

Since generic BOCU has existed for almost three years, and BOCU-1 has
been described since mid-2002, this would give Asian sites enough time
to implement one or both forms -- or express strong interest in seeing
them implemented. BOCU-1 has been implemented in ICU (International
Components for Unicode), but is there any information about other
implementations? Specifically, in Asia, or by companies/organizations
that do significant business in Asia? 

If BOCU-1 has been implemented, is there information about its use? It's
not clear to me that the Consortium needs to solve the space problem
associated with UTF-8. Yes, we have heard the complaint that UTF-8
consumes more space for Asian characters than do native encodings, but
do we have information to show that file size remains an important
enough issue that this is a significant hurdle in Unicode acceptance?
There are always more forms we can add to solve special classes of
problems, but each adds to the cost of implementing the standard. I am
not convinced that the benefits associated with BOCU-1 outweigh the
complexity it adds for Unicode implementers.

If we do approve this as a UTS, what does that do to the status of SCSU?
Do we really need two Unicode compression algorithms, even though BOCU
focuses on UTF-8 and so is not exactly equivalent to SCSU? (Perhaps SCSU
should be deprecated, since this draft notes that is has not been widely
implemented, but I don't know what compatibility issues that raises.) 

What about guidance with respect to whether generic BOCU should ever be
used? The draft describes generic BOCU briefly, and then ignores it for
the rest of the document, which probably means it is to be considered
obsolete. But I'm guessing at that.

At the very least, the draft should clarify BOCU-1 with respect to SCSU
and generic BOCU. But I need information about the compelling case for
adding Yet Another Unicode Form before I can support this as a UTS.