Re: HTML5 encodings (was: Re: BOCU patent)

From: Andrew West (andrewcwest@gmail.com)
Date: Mon Dec 28 2009 - 05:33:33 CST

Next message: Andrew West: "Re: Filtering and displaying untrusted UTF-8"

Previous message: Michael Everson: "Vertical line(s) below"
In reply to: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"
Next in thread: verdy_p: "re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

2009/12/28 Doug Ewell <doug@ewellic.org>:
>
> Ā U+0100 LATIN CAPITAL LETTER A WITH MACRON
> in UTF-32BE: { 00 00 01 00 }
> in UTF-32LE: { 00 01 00 00 }
>
> 𐀀 U+10000 LINEAR B SYLLABLE B008 A
> in UTF-32BE: { 00 01 00 00 }
> in UTF-32LE: { 00 00 01 00 }
>
> Naturally you wouldn't have a whole string of these in real life, so the
> heuristic would work.

You can't make that assumption. Linear B users are much more likely to
use UTF-32 than other users, so a string of the above byte sequences
may be more likely to be a string of LINEAR B SYLLABLE B008 A
characters even though that is a far rarer character than LATIN
CAPITAL LETTER A WITH MACRON. So I can't see how the heuristic would
be able to know whether it was big-endian or little-endian in this
case.

I've just tested the scenario with BabelPad and it autodetects a
string of U+0100 characters saved as UTF32LE with no BOM as UTF32BE
(i.e. a string of U+10000 characters), and autodetects a string of
U+10000 characters saved as UTF32LE with no BOM as UTF32BE (i.e. a
string of U+0100 characters). Has the heuristic failed? Probably,
because on Windows, all things being equal, little-endian should be
assumed rather than big-endian. (Of course, once you add a CR/LF to
the file the heuristic correctly autodetects both files as UTF32LE.)

Andrew

Next message: Andrew West: "Re: Filtering and displaying untrusted UTF-8"
Previous message: Michael Everson: "Vertical line(s) below"
In reply to: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"
Next in thread: verdy_p: "re: BOCU patent (was: Re: Medievalist ligature character in the PUA)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 05:36:54 CST