Re: Recommendations for Unicode auto-detection

From: Doug Ewell ([email protected])
Date: Wed Oct 06 2010 - 12:58:28 CDT

Next message: Mark Davis ☕: "Fwd: some Unicode 6.0 symbols"

Previous message: Mark Davis ☕: "Re: Unicode denormalizer"
Maybe in reply to: Bjoern Hoehrmann: "Recommendations for Unicode auto-detection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Bjoern Hoehrmann <derhoermi at gmx dot net> wrote:

> ... However, there are no or insufficient recommendations when
> protocols should allow [U+FEFF signatures], and which of the many
> signatures should be recognized when performing auto-detection.

I assume you have read http://unicode.org/faq/utf_bom.html#BOM .

Increasingly, protocols tend to discourage or forbid the use of U+FEFF
signatures, either to achieve poor-man's compatibility with 8-bit legacy
applications (like shell scripts), or out of fears that two encoding
declarations in the same document (e.g. U+FEFF signature plus XML
"encoding") might disagree.

This type of objection to in-band tagging mechanisms tends to assume
that all worthwhile data is in a high-level markup format, or that
processing these sequences is too difficult for 21st-century software.

> Furthermore, the signatures are ambiguous.

The only ambiguity I can think of is where "little-endian UTF-16 BOM
followed by U+0000" can be confused with "little-endian UTF-32 BOM."
Most text strings do not begin with U+0000, so even this case is more of
a theoretical problem than a real one.

There are several possible byte sequences for the UTF-7 signature, but
this is more of an inconvenience than an ambiguity. UTF-7 signatures
tend to appear more in comprehensive tables of signatures than in actual
content.

> This has lead to a situation where protocols vary considerably leading
> to interoperability failures and potential security problems. For
> instance, it is common for XML processors to support UTF-32 and detect
> it properly, while other formats, like "HTML5" require treating
> documents with a UTF-32 LE signature as UTF-16 LE. Yet other formats,
> like JSON, are textual in nature and permit only various Unicode
> encodings, but do not permit the BOM.

HTML5, at least, deliberately forbids the use of certain encodings (like
SCSU) and auto-detection of others (like UTF-32), not only to prevent
cross-site scripting attacks, but out of a belief that supporting them
"just wastes developer time." See
http://lists.w3.org/Archives/Public/public-html-comments/2008Jan/0032.html
to see this viewpoint expressed by an HTML Working Group participant.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Next message: Mark Davis ☕: "Fwd: some Unicode 6.0 symbols"
Previous message: Mark Davis ☕: "Re: Unicode denormalizer"
Maybe in reply to: Bjoern Hoehrmann: "Recommendations for Unicode auto-detection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 06 2010 - 13:03:48 CDT