Re: UTS#40 (BOCU-1) special handling of large blocks

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Feb 04 2007 - 16:05:38 CST

  • Next message: Arne Götje (高盛華): "Re: writing Chinese dialects"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > BOCU-1 contains some provisions in its encoding for handling large
    > some blocks so that the smallest codes will be assigned to the
    > codepoints in the middle of these large blocks.
    >
    > However I wonder if this is not a bit arbitrary, because these codes
    > could like not be the most frequently used ones.

    I'm sure it is arbitrary, in the sense that every "exceptional" large
    block adds some complexity to the algorithm, and the simplicity of
    BOCU-1 compared to SCSU was supposed to be one of its selling points.
    CJK and Hangul are likely to be the most commonly used large blocks, by
    a wide margin.

    > For example, for the Hangul syllables blocks, it seems that there's a
    > higher frequency for CV syllables than for CVC syllables, and within
    > both of them there's a higher frequency for syllables starting with a
    > null IEUNG consonnant; this has the effect that the average codepoint
    > value is shifted down, and that the most frequent codepoints should be
    > accessible with the smallest differences from anywhere in the block.
    > unfortunately the BOCU-1 design assumes that the most frequent codes
    > are in the middle of the block, and this is not the case here. So I
    > wonder if the arbitrary constants chosen to store the current state in
    > the Hangul block is appropriate. My opinion is that this state should
    > be nearer from the subset of codepoints starting by a null ieung
    > leading consonnant jamo.

    I think the goal was to ensure that the entire block was accessible via
    2-byte sequences. Korean text tends to jump all over the Hangul
    Syllables block, and since single bytes only cover a range of -40 to +3F
    code points, it's unlikely many of them could be employed anyway.

    > Other large blocks (more than 128 codepoints) have been forgotten:
    > ...
    > * FB50..FDFF; Arabic Presentation Forms-A

    This was probably intentional, since use of those characters is frowned
    upon anyway.

    > For the following block, I don't think we can define a good
    > statistical model:
    > * 12000..123FF; Cuneiform
    > so the default is reasonnable.
    > ...

    More importantly, Cuneiform was just recently added, and there is no way
    the BOCU-1 spec is going to be updated periodically to add new rules.

    In UTN #14 I asked to have the SCSU spec updated to remove the "special
    case" status from U+000C (FORM FEED), since it is not used as a tag but
    still must be escaped with SQ0. The byte 0C is listed as "reserved for
    future use," which made me think such a request was reasonable, and in
    any case encoders and decoders could have been made tolerant of either
    the escaped or unescaped encoding. Nevertheless, I was basically told
    that the SCSU spec is fixed and no such proposal would be entertained.
    If this is the case, then certainly a "breaking" change like the
    re-encoding of Yi or Cuneiform text would be out of the question.

    > In fact, the BOCU algorithm (at least in the BOCU-1 profile) is
    > possibly suboptimal...

    This section involves revisiting the entire structural basis of BOCU,
    and I won't dare touch it.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 16:07:41 CST