From: Jukka K. Korpela (firstname.lastname@example.org)
Date: Wed Aug 10 2005 - 02:55:39 CDT
On Wed, 10 Aug 2005, Ritesh wrote:
> Issue is like. We have one application where user can upload a file in
> tab delimited or xls file.
Can you ensure that the encoding of character data is one of those that
you list down? More importantly, does the software that is used for the
upload include information about the encoding? It is generally impossible
to deduce the encoding from the data itself, though rather often one
can apply some heuristics (that Greek for guessing wrong, or accidentally
right at times :-))
> Now, issue is when we read an UTF-8 file with out BOM characters, it
> will be treaded as non Unicode file and try to read file using cp1256
> (Arabic) encoding. And this garbles the arabic characters being read.
Generally, byte order marks tell the byte order _for Unicode encodings_.
They cannot be reliably used to determine whether a data file is,
in fact, in a Unicode encoding (though the previous remark on heuristics
Unfortunately, if you use e.g. an HTML form with a file field, then
browsers, upon submitting the form data (including the contents of a
file), do not include information about the encoding into the form data.
For data received via text input fields, you can use a hidden field with
some suitable test characters to recognize the encoding, but file input
seems to be performed in too straightforward a manner, as raw copying of
bytes from a file.
Without knowing the specifics of the application, I cannot suggest any
better method than prompting the user for information on encoding, with
some guidelines on finding out what the encoding of his or her files is,
This means bothering the user with technical issues, but what else can you
do if the software does not do its job?
By the way, the registered (at IANA) name of the Windows/Arabic encoding
is windows-1256, whereas cp1256 (or cp-1256) is just a commonly used
> We can achieve this by changing our logic to read it as UTF8
I'm not sure I can follow. How would it work to read it as UTF-8, when it
can in fact be in some other encoding?
> but we
> are wondering if there is no difference between UTF8 and cp1256 then
> why the arabic characters garbled.
There's quite a difference! They are two distinct encodings, though they
happen to encode characters in the ASCII range the same way. The
repertoire of characters in windows-1256 is a subset of the characters
representable in UTF-8, but the representation is quite different.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 02:56:18 CDT