Recommendations for Unicode auto-detection

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Mon Oct 04 2010 - 22:47:29 CDT

  • Next message: Bjoern Hoehrmann: "Unicode denormalizer"

    Hi,

      As I understand it, the Unicode standard permits the interpretation of
    a leading U+FEFF as Unicode signature, and sometimes byte order mark, if
    there is no higher-level encoding information, independently of the par-
    ticular encoding chosen, so you have signatures for UTF-8, UTF-7, and so
    on. However, there are no or insufficient recommendations when protocols
    should allow them, and which of the many signatures should be recognized
    when performing auto-detection. Furthermore, the signatures are ambiguo-
    us.

    This has lead to a situation where protocols vary considerably leading
    to interoperability failures and potential security problems. For in-
    stance, it is common for XML processors to support UTF-32 and detect it
    properly, while other formats, like "HTML5" require treating documents
    with a UTF-32 LE signature as UTF-16 LE. Yet other formats, like JSON,
    are textual in nature and permit only various Unicode encodings, but do
    not permit the BOM.

    In case of JSON the problem is further amplified by a primary consumer,
    the XMLHttpRequest interface, always checking for a signature, whether
    the format allows it or not, so your JSON content works in the browser
    when using that interface, but may not work elsewhere. XMLHttpRequest
    further does not check for UTF-32, with or without signature, but the
    JSON specification suggests performing auto-detection for that using
    that JSON entities start with some ASCII code point, which leads to
    another interoperability problem.

    Is there some guidance in the Unicode standard that I've missed, or is
    there some guidance that could be offered to authors of new protocols,
    or those revising existing protocols, to ease the pain?

    regards,

    -- 
    Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
    Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
    


    This archive was generated by hypermail 2.1.5 : Mon Oct 04 2010 - 17:55:45 CDT