re: BOCU patent (was: Re: Medievalist ligature character in the PUA)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Fri Dec 18 2009 - 07:25:31 CST

Next message: verdy_p: "re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"

Previous message: William_J_G Overington: "Is there a Japanese character for the word Unicode? (from Re: Unicode Haiku Contest)"
In reply to: Doug Ewell: "BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Next in thread: Doug Ewell: "Re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Reply: Doug Ewell: "Re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Doug Ewell"
> I'm not sure what this means, but all multiple-byte character encodings
> have different ranges for lead bytes and trail bytes. Self-delimiting
> numeric values use a different range for the last byte of the sequence.
> So this idea isn't novel either.

Separate ranges has a benefit: it allows fast text search algorithms to work reliably as it allows easy
resynchronisation from random positions.
To have the Boyer-Moore search algorithm to work with BOCU-1 (or most BOCU profiles in general) requires lots of
modifications to handle the byte position (relative to the start of the byte sequence) within the search key stream
(this requires much larger tables, at least one lookup table for each position, and it is still not enough), and you
cannot read the stream in backward direction.

> I'd be surprised to see any real-world text encoded in BOCU-1, not only
> because it's probably the world's only IP-encumbered character encoding,
> but because it has been stigmatized by the HTML 5 Working Draft
> , which actually *forbids* conformant user
> agents from recognizing it (along with CESU-8 and UTF-7 and SCSU).

I did not know that HTML5 *forbidded* supporting some MIME-registered charsets.

Do you mean instead that it forbids recognizing automatically when the charset is unknown (not specified by the
resource server, and not specified with the source link) and must be guessed from the bytes content of the stream ?

HTML5 has another more possibly serious problem: it absolutely *requires* the automatic recognition of BOMs
(possibly starting UTF-8, UTF-16 and UTF-32 streams), without even checking if the ressource is actually encoded as
a plain-text. This works if the ressource is effectively HTML, or XML/XHTML, or CSS, or JavaScript, or some other
programming language for scripts, or tabulated data used in active components, or JASON-structured data, but it does
not protect correctly the binary ressources.

I have left a bug issue about this problem in the W3's HTML5 Draft Sepecification site, because this draft is now on
the "last call before freeze" step. Unfortunately, it was classified as "low priority". Or may be it was not
completely understood. I suggested to restrict the automatic handling of BOMs only to ressources whose MIME type
starts with "text/", or to a restricted list of other MIME types (that could be extended by browsers) such as
"application/xml" or "image/svg", whose internal representation is also plain-text (most of them are actually XML-
based, so the actual autodetection of BOMs would actually not be part of the HTML engine, but part of the XML parser
used to parse this MIME type).

Next message: verdy_p: "re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Previous message: William_J_G Overington: "Is there a Japanese character for the word Unicode? (from Re: Unicode Haiku Contest)"
In reply to: Doug Ewell: "BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Next in thread: Doug Ewell: "Re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Reply: Doug Ewell: "Re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 18 2009 - 07:28:45 CST