Re: Recommendations for Unicode auto-detection

From: Doug Ewell (doug@ewellic.org)
Date: Wed Oct 06 2010 - 12:58:28 CDT

  • Next message: Mark Davis ☕: "Fwd: some Unicode 6.0 symbols"

    Bjoern Hoehrmann <derhoermi at gmx dot net> wrote:

    > ... However, there are no or insufficient recommendations when
    > protocols should allow [U+FEFF signatures], and which of the many
    > signatures should be recognized when performing auto-detection.

    I assume you have read http://unicode.org/faq/utf_bom.html#BOM .

    Increasingly, protocols tend to discourage or forbid the use of U+FEFF
    signatures, either to achieve poor-man's compatibility with 8-bit legacy
    applications (like shell scripts), or out of fears that two encoding
    declarations in the same document (e.g. U+FEFF signature plus XML
    "encoding") might disagree.

    This type of objection to in-band tagging mechanisms tends to assume
    that all worthwhile data is in a high-level markup format, or that
    processing these sequences is too difficult for 21st-century software.

    > Furthermore, the signatures are ambiguous.

    The only ambiguity I can think of is where "little-endian UTF-16 BOM
    followed by U+0000" can be confused with "little-endian UTF-32 BOM."
    Most text strings do not begin with U+0000, so even this case is more of
    a theoretical problem than a real one.

    There are several possible byte sequences for the UTF-7 signature, but
    this is more of an inconvenience than an ambiguity. UTF-7 signatures
    tend to appear more in comprehensive tables of signatures than in actual
    content.

    > This has lead to a situation where protocols vary considerably leading
    > to interoperability failures and potential security problems. For
    > instance, it is common for XML processors to support UTF-32 and detect
    > it properly, while other formats, like "HTML5" require treating
    > documents with a UTF-32 LE signature as UTF-16 LE. Yet other formats,
    > like JSON, are textual in nature and permit only various Unicode
    > encodings, but do not permit the BOM.

    HTML5, at least, deliberately forbids the use of certain encodings (like
    SCSU) and auto-detection of others (like UTF-32), not only to prevent
    cross-site scripting attacks, but out of a belief that supporting them
    "just wastes developer time." See
    http://lists.w3.org/Archives/Public/public-html-comments/2008Jan/0032.html
    to see this viewpoint expressed by an HTML Working Group participant.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Wed Oct 06 2010 - 13:03:48 CDT