Re: UTS#40 (BOCU-1) special handling of large blocks

From: Markus Scherer (markus.icu@gmail.com)
Date: Wed Feb 07 2007 - 18:29:10 CST

Next message: Lokesh Joshi: "Query for Validity of Thai Sequence"

Previous message: Markus Scherer: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"
In reply to: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Next in thread: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2/4/07, Doug Ewell <dewell@adelphia.net> wrote:
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> > BOCU-1 contains some provisions in its encoding for handling large
> > some blocks so that the smallest codes will be assigned to the
> > codepoints in the middle of these large blocks.
> >
> > However I wonder if this is not a bit arbitrary, because these codes
> > could like not be the most frequently used ones.
>
> I'm sure it is arbitrary, in the sense that every "exceptional" large
> block adds some complexity to the algorithm,

Yes and no. Yes, it's arbitrary - when you design a charset or a data
structure, you design and optimize it for what you deem important or
interesting.

> and the simplicity of
> BOCU-1 compared to SCSU was supposed to be one of its selling points.

This played a role.

> CJK and Hangul are likely to be the most commonly used large blocks, by
> a wide margin.

This played a large role.

> I think the goal was to ensure that the entire block was accessible via
> 2-byte sequences.

This is the key: Without special handling for Unihan and Hangul, it
would use 3-byte sequences for many of the characters. This is not a
problem for Yi and other "large" blocks that have fewer than 10000
code points because they always stay within the range of 2-byte
deltas.

Same reason for Hiragana: The special adjustment is to keep this
script, which is not 128-aligned but relatively common, within
single-byte deltas. This is probably the most debatable special
adjustment in BOCU-1.

> ... I was basically told
> that the SCSU spec is fixed and no such proposal would be entertained.
> If this is the case, then certainly a "breaking" change like the
> re-encoding of Yi or Cuneiform text would be out of the question.

There is no reason to change the BOCU-1 encoding of Yi or Cuneiform
because it would not improve compression at all.

In general, if you make an incompatible change - a change where an old
decoder cannot cope with the output from an updated encoder - then you
must change the name of the charset.

Best regards,
markus

Next message: Lokesh Joshi: "Query for Validity of Thai Sequence"
Previous message: Markus Scherer: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"
In reply to: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Next in thread: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Feb 07 2007 - 18:31:48 CST