re: BOCU patent (was: Re: Medievalist ligature character in the PUA)

From: verdy_p (
Date: Fri Dec 18 2009 - 07:25:31 CST

  • Next message: verdy_p: "re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"

    "Doug Ewell"
    > I'm not sure what this means, but all multiple-byte character encodings
    > have different ranges for lead bytes and trail bytes. Self-delimiting
    > numeric values use a different range for the last byte of the sequence.
    > So this idea isn't novel either.

    Separate ranges has a benefit: it allows fast text search algorithms to work reliably as it allows easy
    resynchronisation from random positions.
    To have the Boyer-Moore search algorithm to work with BOCU-1 (or most BOCU profiles in general) requires lots of
    modifications to handle the byte position (relative to the start of the byte sequence) within the search key stream
    (this requires much larger tables, at least one lookup table for each position, and it is still not enough), and you
    cannot read the stream in backward direction.

    > I'd be surprised to see any real-world text encoded in BOCU-1, not only
    > because it's probably the world's only IP-encumbered character encoding,
    > but because it has been stigmatized by the HTML 5 Working Draft
    > , which actually *forbids* conformant user
    > agents from recognizing it (along with CESU-8 and UTF-7 and SCSU).

    I did not know that HTML5 *forbidded* supporting some MIME-registered charsets.

    Do you mean instead that it forbids recognizing automatically when the charset is unknown (not specified by the
    resource server, and not specified with the source link) and must be guessed from the bytes content of the stream ?

    HTML5 has another more possibly serious problem: it absolutely *requires* the automatic recognition of BOMs
    (possibly starting UTF-8, UTF-16 and UTF-32 streams), without even checking if the ressource is actually encoded as
    a plain-text. This works if the ressource is effectively HTML, or XML/XHTML, or CSS, or JavaScript, or some other
    programming language for scripts, or tabulated data used in active components, or JASON-structured data, but it does
    not protect correctly the binary ressources.

    I have left a bug issue about this problem in the W3's HTML5 Draft Sepecification site, because this draft is now on
    the "last call before freeze" step. Unfortunately, it was classified as "low priority". Or may be it was not
    completely understood. I suggested to restrict the automatic handling of BOMs only to ressources whose MIME type
    starts with "text/", or to a restricted list of other MIME types (that could be extended by browsers) such as
    "application/xml" or "image/svg", whose internal representation is also plain-text (most of them are actually XML-
    based, so the actual autodetection of BOMs would actually not be part of the HTML engine, but part of the XML parser
    used to parse this MIME type).

    This archive was generated by hypermail 2.1.5 : Fri Dec 18 2009 - 07:28:45 CST