Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 12:50:35 CDT

  • Next message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Anyway, what I gave is not an algorithm, but a reasonably good heuristic.
    As this heuristic detects charsets using a priority order, what can be true
    is that this heuristic will never fail if the user sends you a file encoded
    with a BOM in UTF-8 or UTF-16.

    For this reason, you should document that these formats are recommanded. If
    you don't want a BOM in UTF-8 file, be ready to receive complaints from
    users editing their text files on Windows with Notepad, Write/WordPad, and
    even many other editors for Windows, as they silently add this leading BOM,
    even for UTF-8 files.

    I am not sure you should forbid this BOM in UTF-8 (even if Unicode does not
    recommend its usage), given that you are already lenient by accepting other
    non-Unicode charsets as a facility for users of legacy solutions. Windows is
    already so common that you can nearly always expect this BOM will be present
    in almost all UTF-8 files you'll receive. (Note that UTF-16 files, that
    Windows encode using the "Unicode" charset name, also include this BOM, even
    if the user selects explicitly "Unicode big-endian"; Windows editors only
    generate "UTF-16" with BOM, and not directly UTF-16BE or UTF-16LE).

    It is often best practice, when working with multiple encodings, to accept
    and generate text files starting with this BOM, simply because plain-text
    files have no other meta-data to explicitly reveal their encoding.

    In that case, you could also accept SCSU-encoded files (also starting with
    BOM). Note that the use of ZWNSP for something else than a BOM is deprecated
    and strongly discouraged by Unicode, which has created another format
    control for all useful and semantically significant cases where ZWNBSP was
    needed. For this reason, if you detect a ZWNSP character in the middle of
    the file, your application may specify that this character will be silently
    replacing it by the new recommanded control character (this is not strictly
    conforming to Unicode, but it may be part of the specification of the upper
    layer protocol in your application, in the same way that natural or
    programming languages specify additional syntaxic and semantic requirements
    on characters above the Unicode).

    ----- Original Message -----
    From: "Ritesh" <ritesh.h.patel@gmail.com>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    Cc: "Samuel Thibault" <samuel.thibault@ens-lyon.org>; "Jon Hanna"
    <jon@hackcraft.net>; <dewell@adelphia.net>; <unicode@unicode.org>
    Sent: Wednesday, August 10, 2005 7:34 PM
    Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

    Hi,

    Thanks everyone for detailed explaination. We are planing to go with
    the approch mentioned by
    Philippe.

    We thought if we can get some other approch then it will be more
    helpful. But it seems we need to go as mentioned in mail below.

    We tried with this approch and it is working fine. All this will be
    really helpful to make our customer understand.

    Thanks again,
    Ritesh

    On 8/10/05, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
    > From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    > > Ritesh, le Wed 10 Aug 2005 12:33:21 +0530, a écrit :
    > >> Now we have few user who upload a file which can contain English and
    > >> other language characters(Here it is Arabic).
    > >
    > > Doesn't the browser tell the charset of the uploaded file?
    >
    > Typically no. Not if you are just uploading a plain-text file initially
    > stored in your filesystem, because the browser will just figure out the
    > MIME
    > type of the file according the filesystem properties (basically the file
    > extension, which for plain-text files is typically ".txt" and maps to the
    > "text/plain" MIME type without charset indication), without even trying to
    > parse its content (to see if there's a charset "indicator" in the text
    > file).
    >
    > What you, Ritesh, need, is a way to make a distinction between a BOM-less
    > UTF-8 text file and a CP1256 or ISO-Arabic text file. For that you'll need
    > an "Heuristic", because there's no algorithm. This means that the
    > detection
    > of the charset will not always return the right answer.
    >
    > Typically, you can first parse the file to detect if it has a leading BOM.
    > If there's a UTF-8 or UTF-16BE or ITF-16LE leading BOM, you can be nearly
    > sure that the encoding is correct, because none of these encoded BOM would
    > be encoded like the begining of a ISO-Arabic or CP1256 text file.
    >
    > Then you'll have to check if it is a BOM-less UTF-8 or UTF-16 file: try
    > decoding the file completely, and if it succeeds with one of these
    > encodings, the file is most probably encoded with these encoding (the
    > chances that the answer will be wrong are extremely low, notably if the
    > file
    > size is long enough and contains enough human language, and not a
    > collection
    > of symbols and digits with few Arabic characters)
    >
    > If this now fails, decode it with CP1256. It may fail if there are some
    > bytes in the 0x80-0x9F range that have no charecter mapping. If this
    > happens, you may then attempt to decode with ISO-Arabic (it will never
    > fail
    > becaise the ISO-Arabic charset is complete and has an unambiguous single
    > character mapping for each possible byte value; however you'll get C1
    > control characters for the 0x80-0x9F range, and these characters are
    > typically not part of plain text)
    >
    > So when you have finished determining the charset, the decoded file will
    > contain only valid Unicode characters. You'll have to check its internal
    > syntax to see if control characters are acceptable for your application.
    > If
    > they are not, then the file is invalid for your application, and probably
    > not plain-text if if contains C1 controls or C0 controls not in {CR, LF,
    > TAB, FF, EM}.
    >
    > Note the EM control character may be presentin plain-text edited with
    > MSDOS
    > tools. Its presence may indicate that the text file was in fact not
    > encoded
    > with CP1256 or ISO-Arabic, but with a DOS Arabic codepage, so you may try
    > to
    > reattempt to decode it with that DOS codepage (the decoding will not fail
    > because this codepage is complete, like the ISO-Arabic charset). This EM
    > control character may only be valid and present at end of file, and
    > can
    > be ignored. It may happen sometimes that this control character was
    > transcoded into UTF-8 or UTF-16 if the original file was edited under DOS,
    > and the file was transcoded sometimes in the past. Otherwise the EM
    > character should be ignored.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 12:52:41 CDT