From: Doug Ewell (dewell@adelphia.net)
Date: Sun Feb 04 2007 - 16:05:38 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> BOCU-1 contains some provisions in its encoding for handling large
> some blocks so that the smallest codes will be assigned to the
> codepoints in the middle of these large blocks.
>
> However I wonder if this is not a bit arbitrary, because these codes
> could like not be the most frequently used ones.
I'm sure it is arbitrary, in the sense that every "exceptional" large
block adds some complexity to the algorithm, and the simplicity of
BOCU-1 compared to SCSU was supposed to be one of its selling points.
CJK and Hangul are likely to be the most commonly used large blocks, by
a wide margin.
> For example, for the Hangul syllables blocks, it seems that there's a
> higher frequency for CV syllables than for CVC syllables, and within
> both of them there's a higher frequency for syllables starting with a
> null IEUNG consonnant; this has the effect that the average codepoint
> value is shifted down, and that the most frequent codepoints should be
> accessible with the smallest differences from anywhere in the block.
> unfortunately the BOCU-1 design assumes that the most frequent codes
> are in the middle of the block, and this is not the case here. So I
> wonder if the arbitrary constants chosen to store the current state in
> the Hangul block is appropriate. My opinion is that this state should
> be nearer from the subset of codepoints starting by a null ieung
> leading consonnant jamo.
I think the goal was to ensure that the entire block was accessible via
2-byte sequences. Korean text tends to jump all over the Hangul
Syllables block, and since single bytes only cover a range of -40 to +3F
code points, it's unlikely many of them could be employed anyway.
> Other large blocks (more than 128 codepoints) have been forgotten:
> ...
> * FB50..FDFF; Arabic Presentation Forms-A
This was probably intentional, since use of those characters is frowned
upon anyway.
> For the following block, I don't think we can define a good
> statistical model:
> * 12000..123FF; Cuneiform
> so the default is reasonnable.
> ...
More importantly, Cuneiform was just recently added, and there is no way
the BOCU-1 spec is going to be updated periodically to add new rules.
In UTN #14 I asked to have the SCSU spec updated to remove the "special
case" status from U+000C (FORM FEED), since it is not used as a tag but
still must be escaped with SQ0. The byte 0C is listed as "reserved for
future use," which made me think such a request was reasonable, and in
any case encoders and decoders could have been made tolerant of either
the escaped or unescaped encoding. Nevertheless, I was basically told
that the SCSU spec is fixed and no such proposal would be entertained.
If this is the case, then certainly a "breaking" change like the
re-encoding of Yi or Cuneiform text would be out of the question.
> In fact, the BOCU algorithm (at least in the BOCU-1 profile) is
> possibly suboptimal...
This section involves revisiting the entire structural basis of BOCU,
and I won't dare touch it.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 16:07:41 CST