RE: Detecting encoding in Plain text

Date: Mon Jan 12 2004 - 06:55:29 EST

  • Next message: Peter Kirk: "Re: Detecting encoding in Plain text"

    Quoting Marco Cimarosti <>:

    > Doug Ewell wrote:
    > > In UTF-16 practically any sequence of bytes is valid, and since you
    > > can't assume you know the language, you can't employ distribution
    > > statistics. Twelve years ago, when most text was not Unicode and all
    > > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    > > of checking every other byte to see if it was zero, which of course
    > > would only work for Latin-1 text encoded in UTF-16.
    > I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
    > BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
    > this method was suggested first by Microsoft: to me, it seems quite
    > self-evident.
    > It is extremely unlikely that a text file encoded in any single- or
    > multi-byte encoding (including UTF-8) would contain a zero byte, so the
    > presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
    > UTF-32.

    False positives can be caused by the use of U+0000 (which is most often encoded
    as 0x00) which some applications do use in text files. Hence you need to look
    for sequences where there is a null octet every other octet, which increases
    the risk of false negatives:

    False negatives can be caused by text that doesn't contain any Latin-1

    The method can be used reliably with text files that are guaranteed to contain
    large amounts of Latin-1 - in particular files for which certain ASCII
    characters are given an application-specific meaning; for instance XML and HTML
    files, comma-delimited files, tab-delimited files, vCards and so on. It can be
    particularly reliable in cases where certain ASCII characters will always begin
    the document (e.g. XML).

    Jon Hanna
    *Thought provoking quote goes here*

    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 07:36:44 EST