Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Ritesh (ritesh.h.patel@gmail.com)
Date: Wed Aug 10 2005 - 02:03:21 CDT

  • Next message: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Hi,

    Thanks a lot for prompt reply.

    Issue is like. We have one application where user can upload a file in
    tab delimited or xls file.

    Now we have few user who upload a file which can contain English and
    other language characters(Here it is Arabic).

    This files can have different combinations as below,
    1. File is a UTF-8 and have English and Arabic Characters.
    2. File is a UTF-16 (LE) and have English and Arabic Characters.
    3. File is UTF-8 and Have only Arabic Characters
    4. File is UTF-8 and Have only English Characters
    5. File is UTF-16 and Have only Arabic Characters
    6. File is UTF-16 and Have only English Characters
    7. File can be in ASCII format.

    Once the file is uploaded it will be displayed on the browser.

    Now, we are using following logic while uploading file.

            byte[] unicodePrefix = new byte[-1,-2] //-This will be used to
    check if first line of stream being read is UTF-16(LE) or not
          
            String encoding = "Cp1256"; // This is for Arabic, it will be
    retrieved runtime from a hashmap depending on user's session language.

            String unicodeEncoding = "UnicodeLittleUnmarked";
            
            while (moreToRead)
            {
            
                // This reads file from an input stream and stores in
    array of bytes i.e. buffer.
                // The buffer size is 16K.

                bytesRead =
    FixServletInputStream.readLine(inStream,buffer,0,buffer.length);

                // Following logic will check if file is UTF-16(LE) file
    and mark necessary variables.

                if ( firstTimeFlag )
                {
                    firstTimeFlag = false;

                    byte[] prefix = new byte[unicodePrefix.length];
                    for(int i = 0; i < unicodePrefix.length; i++)
                    {
                      prefix[i] = buffer[i];
                    }

                    isUnicodeFile = true;

                    for(int i = 0; i < unicodePrefix.length; i++)
                    {
                        if(prefix[i] != unicodePrefix[i])
                        {
                            isUnicodeFile = false;
                            break;
                        }
                    }//End of for
                }//End of if (firstTimeFlag)

                if(EOF())
                    moreToRead = false;
                
                else {
                        //we will read the line and store each word of
    line(Tab delimited) in vector.
                        // If file is unicodeFile, which is set in loop
    above, then skip first two bytes for first line
                        // and read rest of line and also set encoding
    style to unicodeEncoding
                       // i.e."UnicodeLittleUnmarked";
                        // and for all subsequent line read the file based
    on file type.

                        Vector thisLine;
                        if(isUnicodeFile)
                        {
                            encoding = unicodeEncoding;
                            isUnicodeFile = false;
                            int j = 0;
                            byte[] newBuf = new byte[buffer.length -
    unicodePrefix.length];

                            for(int i = unicodePrefix.length; i < bytesRead; i++)
                                newBuf[j++] = buffer[i];
                            String tmpStr = new String(newBuf, 0,
    newBuf.length, encoding);
                            thisLine = tokenize(tmpStr,TAB_DELIMITER);
                        }
                        else
                        {
                            String tmpStr;

                            if(encoding.equals(unicodeEncoding))
                                tmpStr = new String(buffer, 1, bytesRead, encoding);
                            else
                                tmpStr = new String(buffer, 0, bytesRead, encoding);
                                thisLine = tokenize(tmpStr,TAB_DELIMITER);
                        }
                }
            }

    Now, issue is when we read an UTF-8 file with out BOM characters, it
    will be treaded as non Unicode file and try to read file using cp1256
    (Arabic) encoding. And this garbles the arabic characters being read.

    If we set encoding to UTF8 instead of cp1256 for " tmpStr = new
    String(buffer, 0, bytesRead, encoding);" it reads and displays the
    arabic content properly.

    Can you guide if we are missing anything ?
    We can achieve this by changing our logic to read it as UTF8 but we
    are wondering if there is no difference between UTF8 and cp1256 then
    why the arabic characters garbled.

    I would like to add that customer is instructed to upload UFT-8 file
    without BOM. So, we do not have any special check for UTF-8 BOM.

    After getting problem, we tried to implement additional logic where we
    compare first Three bytes for UTF-8 and then read stread same way as
    above, i.e. use cp1256 as encoding. But that also fails.

    Just to reiterate, It works fine if we use UTF-8 as encoding while
    creating string " tmpStr = new String(buffer, 0, bytesRead, "UTF8");"
    but garbles for cp1256.

    Thanks,
    Ritesh

    On 8/9/05, Jon Hanna <jon@hackcraft.net> wrote:
    > Ritesh wrote:
    > > Is there any list which can give the details of Character(s) which
    > > exist in Cp1256, but not supported by UTF-8.
    >
    > There are none.
    >
    > > We are facing some
    > > encoding issues in our application.
    > >
    > > Any pointers to this will be highly appreciated.
    >
    > What's the issue?
    >
    >

    -- 
    Ritesh H. Patel
    Sr. Application Engineer,
    Oracle India Private Limited.
    CG-NW1-2A005, 2nd Floor, Cyber Gateway, Cyberabad, Hyderabad 500 081
    Andra Pradesh 
    91-40-98496 48756 (M) 
    91-40-55392569 (O - Direct Line)
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 02:04:51 CDT