Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 12 2004 - 07:14:09 EST

  • Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"

    On 12/01/2004 03:09, Marco Cimarosti wrote:

    > ...
    >
    >It is extremely unlikely that a text file encoded in any single- or
    >multi-byte encoding (including UTF-8) would contain a zero byte, so the
    >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
    >UTF-32.
    >
    >
    >
    Is it not dangerous to assume that U+0000 is not used? This is a valid
    character and is commonly used e.g. as a string terminator. Perhaps it
    should not be used in truly plain text. But it is likely to occur in
    files which are basically text but include certain kinds of markup.

    >... This is due to the fact that, in any language, shared characters in
    >the Latin-1 range (controls, space, digits, punctuation, etc.) should be
    >more frequent than occasional code points of form <U+??00>. ...
    >
    This one also looks dangerous. Some scripts include their own digits and
    punctuation; not all scripts use spaces; and controls are not
    necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But
    there may be some characters U+??00 which are used rather commonly in a
    particular script and so occur commonly in some text files.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 07:52:33 EST