Re: Pre-proposal for SCSU updates

From: Markus Scherer (
Date: Mon Nov 01 2010 - 16:55:21 CST

  • Next message: CE Whitehead: "RE: ss and ß"

    Hi Doug,

    On Mon, Nov 1, 2010 at 2:50 PM, Doug Ewell <> wrote:

    > I'd like to try to gauge the community's interest, if any, in some
    > possible updates to UTS #6 and the SCSU mechanism, as follows:

    Personally: I like SCSU, but I think it has basically failed. It is
    discouraged for HTML, general-purpose compression is usually better, and
    already available in many open protocols, and in closed systems one can
    often come up with a simpler and more compact custom encoding.

    As the maintainer of UTS #6: I am open to extending the spec, but we have to
    make sure that we version it correctly. The IANA charset "SCSU" must mean
    the current version because existing readers of that charset will fail with
    data for the new version. The UTS #6 text will have to prominently highlight
    features that are not supported by the "SCSU" charset.

    From what I saw some time ago on the ietf-charsets list, we might not get a
    new charset name registered for a new SCSU version. (Because it's not
    documenting existing practice, and not a very useful new charset.)

    (1) Updating the spec to add dynamic-window offsets 0xA8 through 0xBF,
    > to permit encoding the blocks from U+A000 through U+ABFF in single-byte
    > mode. This would allow the many small alphabets assigned to this range,
    > such as Bamum and Syloti Nagri and Phags-Pa, to be encoded efficiently
    > using SCSU. Other offsets could be added as well, such as for Hangul
    > Jamo Extended-B.

    Possible, but you wouldn't want to include the larger scripts in the
    A000..ABFF area (Yi & Vai) because they don't benefit from SCSU windows.
    Also, the several "Xyz Extended-N" blocks probably don't benefit from SCSU
    windows either, they are probably fine with SQU/UQU.

    It seems like even the small scripts there are used so rarely that I
    question the value of a new SCSU version.

    (2) Updating the spec to assign "reserved" tag bytes 0x0C (single-byte
    > mode) and 0xF2 (Unicode mode) as "reset all" commands, similar to 0xFF
    > in BOCU-1. This would allow more efficient encoding in some cases, as
    > well as providing a possible synchronization mechanism for decoders. As
    > an alternative, these unused tag bytes could be released for normal,
    > non-reserved use, so they would no longer require escaping.

    Either way is ok in principle, but again I wonder if their minor benefits
    are worth a new SCSU version.

    (3) Providing an informational section in UTS #6 on "line-safe SCSU," a
    > special-purpose SCSU encoding profile in which all state is returned to
    > the default at the end of each line, and all lines are terminated with
    > CR/LF.

    With no change to the spec, this seems fairly easy.

    I'm aware that many people have been discouraging the use of SCSU
    > altogether, on the basis of Web-page security concerns or the reputation
    > of SCSU as "difficult to implement." These people will not be affected
    > one way or another by any enhancements to SCSU, and I am not focusing on
    > them at present.

    That's fine, but updating a public spec takes a bit of time and work, and if
    there is only a tiny user community for SCSU, it might not be worth it.
    I am not against updating UTS #6, but please consider the costs and


    This archive was generated by hypermail 2.1.5 : Mon Nov 01 2010 - 16:57:45 CST