Re: UTF-8 text files

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Mon Jun 06 2005 - 03:54:53 CDT

  • Next message: Din%$h: "Re: XML attribute normalization and Unicode in C language"

    On Saturday, June 4th, 2005 21:29Z Doug Ewell wrote:

    > Lasse Kärkkäinen / Tronic [email deleted] wrote:
    >
    >> In practice the autodetection by malformed UTF-8 seems to
    >> seems to work quite reliably and it very rarely misdetects legacy
    >> 8-bit as UTF-8 (in fact, I have never seen this happen).
    >
    > It's a contrived example, but the string "NESTLÉ™" encoded in Latin-1

    It is a minor nit, but ™ (U+2122) does not appear in my Latin-1 (ISO/IEC
    8859-1:1998) charts; of course, this character appears at position 9/9 in
    the Windows 1250, 1252, 1254, 1257, 1258 codepages (and also in some others,
    but those do not have É at 12/9).

    > consists of the bytes 4E 45 53 54 4C C9 99. This is a valid UTF-8
    > string, and SC UniPad detects it as such and renders it as "NESTLə".

    Also, I understand that Lasse's argument was that a text file which shows
    /zero/ malformations while decoding as UTF-8 is likely to be in this
    encoding; I understand that examples like NESTLÉ™ are possible inside purely
    English texts (that is, without any other "accentuated" characters), but I
    only shall highlight that similar examples should be reduced in number (yes,
    I noticed Doug wrote "contrived" above.)

    OTOH, I do not have a very clear idea of the overhead of the full search for
    malformed characters over the whole file when encoding is otherwise unknown
    (particularly since such algorithm is to be applied in environments where
    UTF-8 is the most likely encoding, so this means 100% search for every
    "good" file while a much shorter scan for a "bad" file.) I only know this is
    not likely to work for a pipe-oriented program, like the traditional Unix
    tools.

    Antoine



    This archive was generated by hypermail 2.1.5 : Mon Jun 06 2005 - 03:57:25 CDT