Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Aug 10 2005 - 02:55:39 CDT

  • Next message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    On Wed, 10 Aug 2005, Ritesh wrote:

    > Issue is like. We have one application where user can upload a file in
    > tab delimited or xls file.

    Can you ensure that the encoding of character data is one of those that
    you list down? More importantly, does the software that is used for the
    upload include information about the encoding? It is generally impossible
    to deduce the encoding from the data itself, though rather often one
    can apply some heuristics (that Greek for guessing wrong, or accidentally
    right at times :-))

    > Now, issue is when we read an UTF-8 file with out BOM characters, it
    > will be treaded as non Unicode file and try to read file using cp1256
    > (Arabic) encoding. And this garbles the arabic characters being read.

    Generally, byte order marks tell the byte order _for Unicode encodings_.
    They cannot be reliably used to determine whether a data file is,
    in fact, in a Unicode encoding (though the previous remark on heuristics
    may apply).

    Unfortunately, if you use e.g. an HTML form with a file field, then
    browsers, upon submitting the form data (including the contents of a
    file), do not include information about the encoding into the form data.
    For data received via text input fields, you can use a hidden field with
    some suitable test characters to recognize the encoding, but file input
    seems to be performed in too straightforward a manner, as raw copying of
    bytes from a file.

    Without knowing the specifics of the application, I cannot suggest any
    better method than prompting the user for information on encoding, with
    some guidelines on finding out what the encoding of his or her files is,
    This means bothering the user with technical issues, but what else can you
    do if the software does not do its job?

    By the way, the registered (at IANA) name of the Windows/Arabic encoding
    is windows-1256, whereas cp1256 (or cp-1256) is just a commonly used
    unregistered synonym.

    > We can achieve this by changing our logic to read it as UTF8

    I'm not sure I can follow. How would it work to read it as UTF-8, when it
    can in fact be in some other encoding?

    > but we
    > are wondering if there is no difference between UTF8 and cp1256 then
    > why the arabic characters garbled.

    There's quite a difference! They are two distinct encodings, though they
    happen to encode characters in the ASCII range the same way. The
    repertoire of characters in windows-1256 is a subset of the characters
    representable in UTF-8, but the representation is quite different.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 02:56:18 CDT