Re: FYI: Google blog on Unicode

From: verdy_p (
Date: Mon Feb 08 2010 - 23:41:03 CST

  • Next message: Jeroen Ruigrok van der Werven: "Wall with Maya Seignior Glyphs Discovered at Archaeological Zone"

    "Doug Ewell" wrote:
    > What about option 1½: Use charset detection, assisted by the charset
    > tagging. That is, if the content is valid UTF-8 or UTF-16, or something
    > else unambiguous like GB18030, ignore the tagging and trust the
    > detection algorithm fully. But if the algorithm shows that it could
    > reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
    > trust the tag. Just a thought.

    One common cause of unreliable identification of ISO 8859-1 or -2 is that it is frequently replaced by their Windows
    1252 or 1250 "extensions".

    Including those common replacements (notably since they have been approved now in HTML5) should suggest that these
    equivalences should be accepted (who actually uses the C1 controls in HTML? there's only one C1 control that is
    standard in HTML 3/4 and it's been very long now, since the last time I saw a page using it for newlines, apparently
    it occured only from IBM systems through an automatic conversion from some EBCDIC variant, but even those systems
    now support ISO 8859 charsets by ignoring the differences between newlines, so they accept CR/LF equally)

    If the algorithm takes the ISO 8859-x tag unreliable because the page contains some Windows 125x characters (in the
    code range 0x80-0x9F), it is probably wrong: assume Windw 125x instead and use it as the secondary indicator (after
    the statistic estimation euristic).

    Some characters are also good indicators that a ISO 8859-x (or Windows 125x) charset is preset, the most frequent
    being NBSP (U+00A0) which is increasingly present in really a lot of pages (notably within empty table cells used
    for the page layout. Its presence automatically determines the difference between 8-bit charsets (ISO 8859-x,
    Windows 125x), UTF's and other reliable encodings like GB18030 in China or even JIS variants in Japan, and KSC
    variants in South Korea.

    But other indicators are also important: using just statistics based on isolated characters will not be reliable
    enough. For example the detection of NBSP is reliable within specific contexts like after the ">" ending a HTML tag,
    or between a letter and some common punctuations, or between digits.

    Does Google uses such context-based heuristic to improve the detector? I.e. does it try to look for ordered sets of
    pairs or triplets of bytes, and does it adjust its statistic thresholds, based on the exposed document MIME type
    which should be a reliable indicator to trust always ? (HTML, or CSS, or Javascript, or plain text).

    XML is normally not ambiguous (its autodetection algorithm is fully specified for US-ASCII and UTF's only) and
    should not even need to require the use of a custom detector (but this may be wrong, notably from various syndicated
    RSS feeds, built from poorly configured PHP-based sites that actually don't use any CML-based DOM, but only
    concatenate various strings looking mostly like if it was valid XML with the correct schema; Google must probably
    have statistics about such errors, and I don't know how RSS readers can cope with such errors; may be there are
    Windows 125x exceptions there too).

    What about more specific encodings used through AJAX requests (for example JSON-formatted data, commonly used
    instead of XML): is there a charset detector used by those requests performed in Chrome or Chromium, and does it
    uses specific heuristics, and is there a way to disable it completely and force it to use the indicated charset or
    to return a decoding error if this was wrong ?

    What if the MIME type is also wrong or unknown (or is an unknown alias) ? Will Chrome handle it as if it was plain
    text (for example if the exposed MIME type still matches "text/*", but not "image/*" or "application/*") ? Is there
    a MIME type detector in that case ?

    This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 23:44:11 CST