Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Ritesh (ritesh.h.patel@gmail.com)
Date: Wed Aug 10 2005 - 12:34:48 CDT

  • Next message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Hi,

    Thanks everyone for detailed explaination. We are planing to go with
    the approch mentioned by
    Philippe.

    We thought if we can get some other approch then it will be more
    helpful. But it seems we need to go as mentioned in mail below.

    We tried with this approch and it is working fine. All this will be
    really helpful to make our customer understand.

    Thanks again,
    Ritesh

    On 8/10/05, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    > > Ritesh, le Wed 10 Aug 2005 12:33:21 +0530, a écrit :
    > >> Now we have few user who upload a file which can contain English and
    > >> other language characters(Here it is Arabic).
    > >
    > > Doesn't the browser tell the charset of the uploaded file?
    >
    > Typically no. Not if you are just uploading a plain-text file initially
    > stored in your filesystem, because the browser will just figure out the MIME
    > type of the file according the filesystem properties (basically the file
    > extension, which for plain-text files is typically ".txt" and maps to the
    > "text/plain" MIME type without charset indication), without even trying to
    > parse its content (to see if there's a charset "indicator" in the text
    > file).
    >
    > What you, Ritesh, need, is a way to make a distinction between a BOM-less
    > UTF-8 text file and a CP1256 or ISO-Arabic text file. For that you'll need
    > an "Heuristic", because there's no algorithm. This means that the detection
    > of the charset will not always return the right answer.
    >
    > Typically, you can first parse the file to detect if it has a leading BOM.
    > If there's a UTF-8 or UTF-16BE or ITF-16LE leading BOM, you can be nearly
    > sure that the encoding is correct, because none of these encoded BOM would
    > be encoded like the begining of a ISO-Arabic or CP1256 text file.
    >
    > Then you'll have to check if it is a BOM-less UTF-8 or UTF-16 file: try
    > decoding the file completely, and if it succeeds with one of these
    > encodings, the file is most probably encoded with these encoding (the
    > chances that the answer will be wrong are extremely low, notably if the file
    > size is long enough and contains enough human language, and not a collection
    > of symbols and digits with few Arabic characters)
    >
    > If this now fails, decode it with CP1256. It may fail if there are some
    > bytes in the 0x80-0x9F range that have no charecter mapping. If this
    > happens, you may then attempt to decode with ISO-Arabic (it will never fail
    > becaise the ISO-Arabic charset is complete and has an unambiguous single
    > character mapping for each possible byte value; however you'll get C1
    > control characters for the 0x80-0x9F range, and these characters are
    > typically not part of plain text)
    >
    > So when you have finished determining the charset, the decoded file will
    > contain only valid Unicode characters. You'll have to check its internal
    > syntax to see if control characters are acceptable for your application. If
    > they are not, then the file is invalid for your application, and probably
    > not plain-text if if contains C1 controls or C0 controls not in {CR, LF,
    > TAB, FF, EM}.
    >
    > Note the EM control character may be presentin plain-text edited with MSDOS
    > tools. Its presence may indicate that the text file was in fact not encoded
    > with CP1256 or ISO-Arabic, but with a DOS Arabic codepage, so you may try to
    > reattempt to decode it with that DOS codepage (the decoding will not fail
    > because this codepage is complete, like the ISO-Arabic charset). This EM
    > control character may only be valid and present at end of file, and can
    > be ignored. It may happen sometimes that this control character was
    > transcoded into UTF-8 or UTF-16 if the original file was edited under DOS,
    > and the file was transcoded sometimes in the past. Otherwise the EM
    > character should be ignored.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 12:35:52 CDT