Re: HTML5 encodings (was: Re: BOCU patent)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Dec 28 2009 - 01:42:30 CST

  • Next message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"

    "Doug Ewell"
    > Addressing only this one statement for now...
    >
    > "verdy_p" wrote:
    >
    > > Even UTF-32 does not even needs any BOM (because it is self-ordered by
    > > the position of the NUL byte).
    >
    > This fails for any byte sequence { 00, xx, yy, 00 } where xx and yy are
    > both < 0x11. For example:
    >
    > Ā U+0100 LATIN CAPITAL LETTER A WITH MACRON
    > in UTF-32BE: { 00 00 01 00 }
    > in UTF-32LE: { 00 01 00 00 }
    >
    > 𐀀 U+10000 LINEAR B SYLLABLE B008 A
    > in UTF-32BE: { 00 01 00 00 }
    > in UTF-32LE: { 00 00 01 00 }
    >
    > Naturally you wouldn't have a whole string of these in real life, so the
    > heuristic would work. But that's what the BOM is for, so that you don't
    > have to rely on heuristics.

    Correct, but a more realistic example will use ideographic codepoints in plane 2 which could be mixed easily with
    Latin codepoints of the BMP at U+0200. In a real text, you'll have many more characters to differentiate, as those
    text containing only U+xxyy00 codepoints where xx and yy are lower than 0x11 are not very frequent or will occur in
    very different distributions. This case is quite degenerate and will only occur in very short non-realistic texts.

    But anyway, isn't there a default ordering in UTF-32 when no BOM is present ? Why HTML5 wants to change the default
    ordering and still maintain its name as "UTF-32", in contradiction with TUS ? Shouln't HTML5 rename its modified
    encoding as "HTML5-UTF-32" (even if it then requires using the BOM... which was also proposed, and also contradicts
    TUS which only allow optional BOMs in UTF-32 and forbids all BOMs in UTF-32BE and UTF-32LE)...

    Hmmm... I can't say that a new registration should be required as the HTML5 Draft has not be finalized, but if it
    changes from TUS rules, it really MUST register a new MIME charset with its own name in the IANA registry !

    Philippe.



    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 01:44:50 CST