Re: UTS#40 (BOCU-1) special handling of large blocks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 03 2007 - 10:17:14 CST

Next message: Philippe Verdy: "Re: New translation posted"

Previous message: Jukka K. Korpela: "Re: New translation posted"
In reply to: Philippe Verdy: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"
Next in thread: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

BOCU-1 contains some provisions in its encoding for handling large some blocks so that the smallest codes will be assigned to the codepoints in the middle of these large blocks.

However I wonder if this is not a bit arbitrary, because these codes could like not be the most frequently used ones.

For example, for the Hangul syllables blocks, it seems that there's a higher frequency for CV syllables than for CVC syllables, and within both of them there's a higher frequency for syllables starting with a null IEUNG consonnant; this has the effect that the average codepoint value is shifted down, and that the most frequent codepoints should be accessible with the smallest differences from anywhere in the block. unfortunately the BOCU-1 design assumes that the most frequent codes are in the middle of the block, and this is not the case here. So I wonder if the arbitrary constants chosen to store the current state in the Hangul block is appropriate. My opinion is that this state should be nearer from the subset of codepoints starting by a null ieung leading consonnant jamo.

Other large blocks (more than 128 codepoints) have been forgotten:
* A000..A48F; Yi Syllables
* A490..A4CF; Yi Radicals
* F900..FAFF; CJK Compatibility Ideographs
* FB50..FDFF; Arabic Presentation Forms-A
These are handled by splitting them into multiple independant 128-code subranges, each with their own central codepoint chosen arbitrarily to represent the previous state value (this assumes that the next codepoint has greater chances to fall nearer that state codepoint value):
* Given the possible extension range of last trailing bytes (between -121 and +122), or single bytes (between -64 and +63), the Yi blocks may be probably better positioned. I don't have statistics to perform a test of the distribution of codepoint differences.
* For CJK Compatibility Ideographs, they have different subranges and their use is quite exceptional, except for a small subrange which seems to have character-to-character coherence.

The following blocks have similar problems, but generally, surrogate codepoints should be absent from encoded plain-text files (although the spec does not forbid encoding them, or other non-character codepoints):
* D800..DB7F; High Surrogates
* DB80..DBFF; High Private Use Surrogates
* DC00..DFFF; Low Surrogates

For the following:
* E000..F8FF; Private Use Area
the default behavior is reasonnable, it assumes that private uses will attempt to group their assignments into coherent subblocks with similar functions so that if multiple codepoints in this block occur, there's a greater chance that they will have more chance to have smaller differences, and in many cases, only a small subset of the range will be used.

For the following block, I don't think we can define a good statistical model:
* 12000..123FF; Cuneiform
so the default is reasonnable.

For the supplementary large blocks in plane 2:
* 20000..2A6DF; CJK Unified Ideographs Extension B
* 2F800..2FA1F; CJK Compatibility Ideographs Supplement
The default behavior is also reasonnable, as they will most often occur as isolated islands within an ocean of common CJK Han ideographs. But then, is it really useful to change the current state? Couldn't these use the same state as basic Han ideographs? This would save one "big jump" per occurence of these characters.

Finally, for the following:
* 1D400..1D7FF; Mathematical Alphanumeric Symbols
the default seems reasonnable, as the alphabets are grouped, and anyway, these letter-like math symbols often occur in isolation, separated by spaces and operators in lots of other possible blocks...

In fact, the BOCU algorithm (at least in the BOCU-1 profile) is possibly suboptimal: the assignment of lead bytes and multibyte length should depend on the current block (represented by the "prev" state variable), whilst still preserving the binary ordering. The various lead bytes need not to be grouped by multibyte length (i.e. so that longer sequences will necessarily be encoded for larger absolute differences): this would be optimal if the distribution of differences is gaussian-like (or more exactly, monotonic and falling on each side of each mean codepoint stored in the "prev" state variable).

However, if there are pikes on each side, the simple heuristic used in BOCU-1 is suboptimal, because the differences found in those pikes will have longer encoding length than necessary.

One common example of this suboptimal distribution occurs with frequent shifts into (and immediately after out from) general punctuation blocks from all european script blocks (and even from most modern Asian texts.) The same is true for general diacritics:
* once can solve one half of the problem by NOT changing the current state into these blocks, so that small differences will remain encodable for letters after these single occurences.
* the other optimization is to estimate the average distance between those frequent punctuations blocks or diacritic blocks and frequent letters on the correctponding scripts encoded in those blocks; these differences need not necessarily be encoded with longer sequences.

The important key of the BOCU compression effect is that the encoding lengths are predicted, not the fact that the encoded bytes for the same length are grouped together. Also lead bytes for long sequences (4 bytes) need not be fully used before using the next lead byte, notably if the next lead byte would better be used to lead shorter sequences associated to more frequent differences:

Once you have determined the optimal encoding length for each range of differences, you can build the effective encoding table of lead bytes.

Note also that the same lead byte could be used to encode sequences with different lengths, depending on the value of the second trailing byte...

Next message: Philippe Verdy: "Re: New translation posted"
Previous message: Jukka K. Korpela: "Re: New translation posted"
In reply to: Philippe Verdy: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"
Next in thread: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 10:19:26 CST