RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 06:00:50 EST

Next message: Chris Jacobs: "Re: Chinese rod numerals"

Previous message: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Jon Hanna wrote:
> False positives can be caused by the use of U+0000 (which is
> most often encoded as 0x00) which some applications do use
> in text files.

I have never seen such a thing, can you make an example?

I can't imagine any use for a NULL in a file apart terminating records or
strings but, of course, a file containing records or string is not what I
would call a "plain-text file", anyway not a "typical" plain-text file.

> The method can be used reliably with text files that are
> guaranteed to contain large amounts of Latin-1

But the Latin-1 (or even just ASCII) range contains some characters which
are shared by most languages (space, new line and/or line feed, digits,
punctuation), so there should be a relatively large amount of Latin-1
characters in most cases.

Even scripts which have their own digits or punctuation often prefer
European digits punctuation, especially in computer usage. E.g., it suffices
to check a few websites (or even printed matter) in Arabic to see that
European digits are much more widespread than native digits.

_ Marco

Next message: Chris Jacobs: "Re: Chinese rod numerals"
Previous message: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Maybe in reply to: Brijesh Sharma: "Detecting encoding in Plain text"
Next in thread: Marco Cimarosti: "RE: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 06:40:09 EST