Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Ritesh (ritesh.h.patel@gmail.com)
Date: Wed Aug 10 2005 - 02:03:21 CDT

Next message: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Raymond Mercier: "unset unicode vacation"
In reply to: Jon Hanna: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Jukka K. Korpela: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: eflarup@yahoo.com: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Theo Veenker: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi,

Thanks a lot for prompt reply.

Issue is like. We have one application where user can upload a file in
tab delimited or xls file.

Now we have few user who upload a file which can contain English and
other language characters(Here it is Arabic).

This files can have different combinations as below,
1. File is a UTF-8 and have English and Arabic Characters.
2. File is a UTF-16 (LE) and have English and Arabic Characters.
3. File is UTF-8 and Have only Arabic Characters
4. File is UTF-8 and Have only English Characters
5. File is UTF-16 and Have only Arabic Characters
6. File is UTF-16 and Have only English Characters
7. File can be in ASCII format.

Once the file is uploaded it will be displayed on the browser.

Now, we are using following logic while uploading file.

        byte[] unicodePrefix = new byte[-1,-2] //-This will be used to
check if first line of stream being read is UTF-16(LE) or not

        String encoding = "Cp1256"; // This is for Arabic, it will be
retrieved runtime from a hashmap depending on user's session language.

        String unicodeEncoding = "UnicodeLittleUnmarked";

        while (moreToRead)
        {

            // This reads file from an input stream and stores in
array of bytes i.e. buffer.
            // The buffer size is 16K.

bytesRead =
FixServletInputStream.readLine(inStream,buffer,0,buffer.length);

// Following logic will check if file is UTF-16(LE) file
and mark necessary variables.

            if ( firstTimeFlag )
            {
                firstTimeFlag = false;

                byte[] prefix = new byte[unicodePrefix.length];
                for(int i = 0; i < unicodePrefix.length; i++)
                {
                  prefix[i] = buffer[i];
                }

isUnicodeFile = true;

                for(int i = 0; i < unicodePrefix.length; i++)
                {
                    if(prefix[i] != unicodePrefix[i])
                    {
                        isUnicodeFile = false;
                        break;
                    }
                }//End of for
            }//End of if (firstTimeFlag)

            if(EOF())
                moreToRead = false;

            else {
                    //we will read the line and store each word of
line(Tab delimited) in vector.
                    // If file is unicodeFile, which is set in loop
above, then skip first two bytes for first line
                    // and read rest of line and also set encoding
style to unicodeEncoding
                   // i.e."UnicodeLittleUnmarked";
                    // and for all subsequent line read the file based
on file type.

                    Vector thisLine;
                    if(isUnicodeFile)
                    {
                        encoding = unicodeEncoding;
                        isUnicodeFile = false;
                        int j = 0;
                        byte[] newBuf = new byte[buffer.length -
unicodePrefix.length];

                        for(int i = unicodePrefix.length; i < bytesRead; i++)
                            newBuf[j++] = buffer[i];
                        String tmpStr = new String(newBuf, 0,
newBuf.length, encoding);
                        thisLine = tokenize(tmpStr,TAB_DELIMITER);
                    }
                    else
                    {
                        String tmpStr;

                        if(encoding.equals(unicodeEncoding))
                            tmpStr = new String(buffer, 1, bytesRead, encoding);
                        else
                            tmpStr = new String(buffer, 0, bytesRead, encoding);
                            thisLine = tokenize(tmpStr,TAB_DELIMITER);
                    }
            }
        }

Now, issue is when we read an UTF-8 file with out BOM characters, it
will be treaded as non Unicode file and try to read file using cp1256
(Arabic) encoding. And this garbles the arabic characters being read.

If we set encoding to UTF8 instead of cp1256 for " tmpStr = new
String(buffer, 0, bytesRead, encoding);" it reads and displays the
arabic content properly.

Can you guide if we are missing anything ?
We can achieve this by changing our logic to read it as UTF8 but we
are wondering if there is no difference between UTF8 and cp1256 then
why the arabic characters garbled.

I would like to add that customer is instructed to upload UFT-8 file
without BOM. So, we do not have any special check for UTF-8 BOM.

After getting problem, we tried to implement additional logic where we
compare first Three bytes for UTF-8 and then read stread same way as
above, i.e. use cp1256 as encoding. But that also fails.

Just to reiterate, It works fine if we use UTF-8 as encoding while
creating string " tmpStr = new String(buffer, 0, bytesRead, "UTF8");"
but garbles for cp1256.

Thanks,
Ritesh

On 8/9/05, Jon Hanna <jon@hackcraft.net> wrote:
> Ritesh wrote:
> > Is there any list which can give the details of Character(s) which
> > exist in Cp1256, but not supported by UTF-8.
>
> There are none.
>
> > We are facing some
> > encoding issues in our application.
> >
> > Any pointers to this will be highly appreciated.
>
> What's the issue?
>
>

-- 
Ritesh H. Patel
Sr. Application Engineer,
Oracle India Private Limited.
CG-NW1-2A005, 2nd Floor, Cyber Gateway, Cyberabad, Hyderabad 500 081
Andra Pradesh 
91-40-98496 48756 (M) 
91-40-55392569 (O - Direct Line)

Next message: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Raymond Mercier: "unset unicode vacation"
In reply to: Jon Hanna: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Samuel Thibault: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Jukka K. Korpela: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: eflarup@yahoo.com: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Theo Veenker: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 02:04:51 CDT