Re: UTS#40 (BOCU-1) special handling of large blocks

From: Doug Ewell ([email protected])
Date: Thu Feb 08 2007 - 08:35:41 CST

Next message: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"

Previous message: Lokesh Joshi: "Query for Validity of Thai Sequence"
In reply to: Markus Scherer: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Next in thread: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer <markus dot icu at gmail dot com> wrote:

>> I'm sure it is arbitrary, in the sense that every "exceptional" large
>> block adds some complexity to the algorithm,
>
> Yes and no. Yes, it's arbitrary - when you design a charset or a data
> structure, you design and optimize it for what you deem important or
> interesting.

Absolutely correct. "Arbitrary" does not mean "capricious" here.

>> I think the goal was to ensure that the entire block was accessible
>> via 2-byte sequences.
>
> This is the key: Without special handling for Unihan and Hangul, it
> would use 3-byte sequences for many of the characters. This is not a
> problem for Yi and other "large" blocks that have fewer than 10000
> code points because they always stay within the range of 2-byte
> deltas.

I hadn't done the math, but of course you are right. It might be a good
idea to add this point somewhere in the BOCU-1 spec (UTN or UTS).

> In general, if you make an incompatible change - a change where an old
> decoder cannot cope with the output from an updated encoder - then you
> must change the name of the charset.

UTF-8 was initially defined to work across the entire original 31-bit
ISO 10646 code space, with sequences up to 6 bytes long, before Unicode
and 10646 agreed to limit the range to U+10FFFF. The definition of
UTF-8 appears to have been changed, and I've personally seen several
decoders that recognized the longer sequences, but AFAIK the name
"UTF-8" was never changed or qualified with a version number.

I do wonder, as I did months ago when we had this discussion, how many
SCSU decoders have been written that recognize a bare 0C as an error. I
realize this is a pragmatic view and not a pure one. I did write my
decoder to flag unquoted 0C as an error, but I could have chosen to
accept it since Section 4 (conformance clause C1) says "The action of a
conformant decoder on illegal or reserved input is undefined."

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Next message: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Previous message: Lokesh Joshi: "Query for Validity of Thai Sequence"
In reply to: Markus Scherer: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Next in thread: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Reply: Frank Ellermann: "Re: UTS#40 (BOCU-1) special handling of large blocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Feb 08 2007 - 08:41:01 CST