Re: Detecting encoding in Plain text

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Jan 12 2004 - 07:14:09 EST

Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"

Previous message: jon@hackcraft.net: "RE: Detecting encoding in Plain text"
In reply to: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Next in thread: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Philippe Verdy: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 12/01/2004 03:09, Marco Cimarosti wrote:

> ...
>
>It is extremely unlikely that a text file encoded in any single- or
>multi-byte encoding (including UTF-8) would contain a zero byte, so the
>presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
>UTF-32.
>
>
>
Is it not dangerous to assume that U+0000 is not used? This is a valid
character and is commonly used e.g. as a string terminator. Perhaps it
should not be used in truly plain text. But it is likely to occur in
files which are basically text but include certain kinds of markup.

>... This is due to the fact that, in any language, shared characters in
>the Latin-1 range (controls, space, digits, punctuation, etc.) should be
>more frequent than occasional code points of form <U+??00>. ...
>
This one also looks dangerous. Some scripts include their own digits and
punctuation; not all scripts use spaces; and controls are not
necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But
there may be some characters U+??00 which are used rather commonly in a
particular script and so occur commonly in some text files.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Philippe Verdy: "Re: Detecting encoding in Plain text"
Previous message: jon@hackcraft.net: "RE: Detecting encoding in Plain text"
In reply to: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Next in thread: Philippe Verdy: "Re: Detecting encoding in Plain text"
Reply: Philippe Verdy: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 07:52:33 EST