Re: HTML5 encodings (was: Re: BOCU patent)

From: Andrew West (andrewcwest@gmail.com)
Date: Mon Dec 28 2009 - 05:33:33 CST

  • Next message: Andrew West: "Re: Filtering and displaying untrusted UTF-8"

    2009/12/28 Doug Ewell <doug@ewellic.org>:
    >
    > Ā U+0100 LATIN CAPITAL LETTER A WITH MACRON
    > in UTF-32BE: { 00 00 01 00 }
    > in UTF-32LE: { 00 01 00 00 }
    >
    > 𐀀 U+10000 LINEAR B SYLLABLE B008 A
    > in UTF-32BE: { 00 01 00 00 }
    > in UTF-32LE: { 00 00 01 00 }
    >
    > Naturally you wouldn't have a whole string of these in real life, so the
    > heuristic would work.

    You can't make that assumption. Linear B users are much more likely to
    use UTF-32 than other users, so a string of the above byte sequences
    may be more likely to be a string of LINEAR B SYLLABLE B008 A
    characters even though that is a far rarer character than LATIN
    CAPITAL LETTER A WITH MACRON. So I can't see how the heuristic would
    be able to know whether it was big-endian or little-endian in this
    case.

    I've just tested the scenario with BabelPad and it autodetects a
    string of U+0100 characters saved as UTF32LE with no BOM as UTF32BE
    (i.e. a string of U+10000 characters), and autodetects a string of
    U+10000 characters saved as UTF32LE with no BOM as UTF32BE (i.e. a
    string of U+0100 characters). Has the heuristic failed? Probably,
    because on Windows, all things being equal, little-endian should be
    assumed rather than big-endian. (Of course, once you add a CR/LF to
    the file the heuristic correctly autodetects both files as UTF32LE.)

    Andrew



    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 05:36:54 CST